I downloaded the metadata for every package on PyPi and this is what I found out

I recently presented a lightning talk to some colleagues on how to port Python 2 code to Python 3. Writing code that supports Python 3 is now especially important after Python 2 reached its end of support on 01/01/2020. When writing the presentation I was looking for some statistics on how many Python packages now support Python 3, and how many of those have dropped support for Python 2. From scouring the internet I wasn't able to find anything concrete, so I though I would answer the question for the rest of the world.

PyPi.org is the main Python Package repository and is maintained by the Python Packaging Authority. There are many other public mirrors and it's common for large organisations to host their own, but PyPi tends to be the source of truth when it comes to Python libraries. This is likely the best place to get an answer to my question. The most prominent way to denote versioning of a Python package is via classifier strings. However after inspecting the source code of pip, I found that no attention is paid to these values, and they appear to be purely cosmetic. Other ways include the supported Python version in the filename of a .whl file (set by package metadata), and an under-utilised requires_python field that can be defined in the setup.py of a package.

The problem I encountered is that PyPi provides a very limited API. There are three types: simple, JSON & XMLRPC, none of which provide a flexible querying language beyond classifiers. Getting an answer to my question was going to require digging a bit deeper, so I decided to go with the JSON API and download the metadata for every package into an SQLite database that I can then query. After a number of hours coding I arrived at this.  Two hours, a large number of API requests and a 300MB database file later, the script had finished running.

At the time of this experiment there were 227,069 packages published to PyPi. As well as well maintained and actively developed packages, PyPi has its fair share of stale projects and other junk. For this exercise, any package that had not had a release in the last four years was considered stale, unmaintained and not included in the results below. This brings the total number of packages down to 179,171.

Package Versioning

As aforementioned, string classifiers are the most common way to denote versioning of a Python package so I started there. It's worth mentioning that of the active packages on PyPi, 49,193 (27%) have no classifiers at all. Computing various queries checking for classifiers yielded the following results:

Filter Description No. of PackagesPercentage of total active packages
Packages with no Python version classifiers 70,51239%
Packages with a Python 2 classifier 39,73122%
Packages with a Python 3 classifier 100,27356%
Packages with Python 2 & 3 classifiers 31,34517%
Packages with Python 2 but not Python 3 classifiers 8,3865%
Packages with Python 3 but not Python 2 classifiers 68,92838%

Beyond the 27% of packages that do not use classifiers, a further 21,319 packages have 0 classifiers that relate to Python version support, bringing the total to 70,512, or 39% of all active Python packages on PyPi. From the data above we can see that the industry has already made a shift towards Python 3, with at least 56% of packages claiming they support 3. 17% have maintained support for Python 2, but 38% exclusively support Python 3.

The next step was to take the 70,512 packages that could not be classified by the method above and look for the python_version in the metadata of the package's latest release. This field is populated to indicate the Python version support for all package formats (e.g. .whl, .msi, .egg, .exe, .dmg files), except for source code distribution files (sdist). This accounts for a further 29,160‬ packages.

Filter Description No. of PackagesPercentage of remaining data set
Packages where the latest release supports Python 2 12,08917%
Packages where the latest release supports Python 3 23,41333%
Packages where the latest release supports Python 2 & 3 63428%
Packages where the latest release supports Python 2 but not 3 6,7089%
Packages where the latest release supports Python 3 but not 2 18,05225%

This then leaves 41,352 packages that were not classified. 1,418 of these populated the requires_python version. However this is a free-text field and error prone when parsing. For such a small amount across a large data set I decided to discount this as a method to classify.

The final numbers:

Filter Description No. of PackagesPercentage of total active packages 
Packages that were not classified 41,35223%
Packages supporting Python 2 51,820‬28%
Packages supporting Python 3 123,686‬69%
Packages supporting Python 2 & 3 37,68721%
Packages supporting Python 2 but not 3 15,094‬8%
Packages supporting Python 3 but not 2 86,98048%

Other Insights

Downloading the metadata into a format that can be queried presented an opportunity to find other insights into the Python world. Below details a few more insights drawn from the data obtained from PyPi.

Actively developed packages

Number of packages that have had a release in the last:

Release in the last: No. of packages:
1 Month: 18,920
3 Months: 38,872
6 Months: 58,503
1 Year: 87,458
2 Years: 127,599
5 Years: 185,485

Licensing

As the license field in a setup.py is free-text the data showed a myriad of inconsistencies, spelling mistakes and formats that made it hard to quantify reliably. For a rudimentary view on the license distribution of packages I decided to go for a keyword-based approach.

License string contains the keyword: No. of Packages of active packages:
MIT 71,839
GPL 32,189
BSD 22,472
Apache 16,832
GNU 5,310
Copyright 1,654
same as 79

Beyond the main licenses listed above I also came across a few humorous ones:
Do whatever you want, don't blame me
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE Version 2, December 2004
<open file 'LICENSE', mode 'r' at 0x7f911d29f5d0>
Buy snare a beer

Top Authors

The author is a textual field and the table below does not take into account inconsistencies in spelling and format, but here are the top 15 people/organisations that author the most packages published to PyPi:

Author Name: No. of Packages:
The Guardians 1,140
Tecnativa, Odoo Community Association (OCA) 617
OpenStack 472
hfpython 465
Paul Sokolovsky 462
Arno-Can Uestuensoez 441
Microsoft Corporation 419
Eficent, Odoo Community Association (OCA) 342
Akretion,Odoo Community Association (OCA) 318
Amazon Web Services 286
Camptocamp,Odoo Community Association (OCA) 246
Aliyun 231
Vlad Emelianov 223
Adafruit Industries 215
MicroPython Developers 194

Conclusion

To answer the initial question, we were able to classify the supported Python version for a total of 137,819‬ active Python packages on PyPi (77%). Of those, 123,686‬ (69%) supported Python 3, and 6708 (21%) supported both Python versions. I expect Python 2's recent end of support will trigger a larger shift to Python 3 and the results will look very different one year from now.

This proved to be a very long-winded project for a question that was initially thought to be trivial to answer. However it has shed some light on the secret world of Python Packages and will allow for the tracking of trends in the Python industry going forward. I have productionised the code into a Python Package for anyone else who wishes to run their own queries, you can find it here. I have also uploaded the data and queries used in this blog here.

Comments