I recently presented a lightning talk to some colleagues on how to port Python 2 code to Python 3. Writing code that supports Python 3 is now especially important after Python 2 reached its end of support on 01/01/2020. When writing the presentation I was looking for some statistics on how many Python packages now support Python 3, and how many of those have dropped support for Python 2. From scouring the internet I wasn't able to find anything concrete, so I though I would answer the question for the rest of the world.
PyPi.org is the main Python Package repository and is maintained by the
Python Packaging Authority. There are many other public mirrors and it's common for large organisations to host their own, but PyPi tends to be the source of truth when it comes to Python libraries. This is likely the best place to get an answer to my question. The most prominent way to denote versioning of a Python package is via
classifier strings. However after inspecting the source code of
pip, I found that no attention is paid to these values, and they appear to be purely cosmetic. Other ways include the supported Python version in the filename of a
.whl
file (set by package metadata), and an under-utilised
requires_python
field that can be defined in the
setup.py
of a package.
The problem I encountered is that PyPi provides a very limited API. There are three types:
simple,
JSON &
XMLRPC, none of which provide a flexible querying language beyond classifiers. Getting an answer to my question was going to require digging a bit deeper, so I decided to go with the JSON API and download the metadata for every package into an SQLite database that I can then query. After a number of hours coding I arrived at
this. Two hours, a large number of API requests and a 300MB database file later, the script had finished running.
At the time of this experiment there were 227,069 packages published to PyPi. As well as well maintained and actively developed packages, PyPi has its fair share of stale projects and other junk. For this exercise, any package that had not had a release in the last four years was considered stale, unmaintained and not included in the results below. This brings the total number of packages down to 179,171.
Package Versioning
As aforementioned, string classifiers are the most common way to denote versioning of a Python package so I started there. It's worth mentioning that of the active packages on PyPi, 49,193 (27%) have no classifiers at all. Computing various queries checking for classifiers yielded the following results:
Filter Description | No. of Packages | Percentage of total active packages |
Packages with no Python version classifiers | 70,512 | 39% |
Packages with a Python 2 classifier | 39,731 | 22% |
Packages with a Python 3 classifier | 100,273 | 56% |
Packages with Python 2 & 3 classifiers | 31,345 | 17% |
Packages with Python 2 but not Python 3 classifiers | 8,386 | 5% |
Packages with Python 3 but not Python 2 classifiers | 68,928 | 38% |
Beyond the 27% of packages that do not use classifiers, a further 21,319 packages have 0 classifiers that relate to Python version support, bringing the total to 70,512, or 39% of all active Python packages on PyPi. From the data above we can see that the industry has already made a shift towards Python 3, with at least 56% of packages claiming they support 3. 17% have maintained support for Python 2, but 38% exclusively support Python 3.
The next step was to take the 70,512 packages that could not be classified by the method above and look for the
python_version
in the metadata of the package's latest release. This field is populated to indicate the Python version support for all package formats (e.g.
.whl
,
.msi
,
.egg
,
.exe
,
.dmg
files), except for source code distribution files (
sdist
). This accounts for a further 29,160 packages.
Filter Description | No. of Packages | Percentage of remaining data set |
Packages where the latest release supports Python 2 | 12,089 | 17% |
Packages where the latest release supports Python 3 | 23,413 | 33% |
Packages where the latest release supports Python 2 & 3 | 6342 | 8% |
Packages where the latest release supports Python 2 but not 3 | 6,708 | 9% |
Packages where the latest release supports Python 3 but not 2 | 18,052 | 25% |
This then leaves 41,352 packages that were not classified. 1,418 of these populated the
requires_python
version. However this is a free-text field and error prone when parsing. For such a small amount across a large data set I decided to discount this as a method to classify.
The final numbers:
Filter Description | No. of Packages | Percentage of total active packages |
Packages that were not classified | 41,352 | 23% |
Packages supporting Python 2 | 51,820 | 28% |
Packages supporting Python 3 | 123,686 | 69% |
Packages supporting Python 2 & 3 | 37,687 | 21% |
Packages supporting Python 2 but not 3 | 15,094 | 8% |
Packages supporting Python 3 but not 2 | 86,980 | 48% |
Other Insights
Downloading the metadata into a format that can be queried presented an opportunity to find other insights into the Python world. Below details a few more insights drawn from the data obtained from PyPi.
Actively developed packages
Number of packages that have had a release in the last:
Release in the last: | No. of packages: |
1 Month: | 18,920 |
3 Months: | 38,872 |
6 Months: | 58,503 |
1 Year: | 87,458 |
2 Years: | 127,599 |
5 Years: | 185,485 |
Licensing
As the
license
field in a
setup.py
is free-text the data showed a myriad of inconsistencies, spelling mistakes and formats that made it hard to quantify reliably. For a rudimentary view on the license distribution of packages I decided to go for a keyword-based approach.
License string contains the keyword: | No. of Packages of active packages: |
MIT | 71,839 |
GPL | 32,189 |
BSD | 22,472 |
Apache | 16,832 |
GNU | 5,310 |
Copyright | 1,654 |
same as | 79 |
Beyond the main licenses listed above I also came across a few humorous ones:
Do whatever you want, don't blame me
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE Version 2, December 2004
<open file 'LICENSE', mode 'r' at 0x7f911d29f5d0>
Buy snare a beer
Top Authors
The
author
is a textual field and the table below does not take into account inconsistencies in spelling and format, but here are the top 15 people/organisations that author the most packages published to PyPi:
Author Name: | No. of Packages: |
The Guardians | 1,140 |
Tecnativa, Odoo Community Association (OCA) | 617 |
OpenStack | 472 |
hfpython | 465 |
Paul Sokolovsky | 462 |
Arno-Can Uestuensoez | 441 |
Microsoft Corporation | 419 |
Eficent, Odoo Community Association (OCA) | 342 |
Akretion,Odoo Community Association (OCA) | 318 |
Amazon Web Services | 286 |
Camptocamp,Odoo Community Association (OCA) | 246 |
Aliyun | 231 |
Vlad Emelianov | 223 |
Adafruit Industries | 215 |
MicroPython Developers | 194 |
Conclusion
To answer the initial question, we were able to classify the supported Python version for a total of 137,819 active Python packages on PyPi (77%). Of those, 123,686 (69%) supported Python 3, and 6708 (21%) supported both Python versions. I expect Python 2's recent end of support will trigger a larger shift to Python 3 and the results will look very different one year from now.
This proved to be a very long-winded project for a question that was initially thought to be trivial to answer. However it has shed some light on the secret world of Python Packages and will allow for the tracking of trends in the Python industry going forward. I have productionised the code into a Python Package for anyone else who wishes to run their own queries, you can find it
here. I have also uploaded the data and queries used in this blog
here.
Comments
Post a Comment