I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downloaded hundreds of times. Over the next few days, I was more surprised to see the download count increasing by sometimes hundreds per day, even though this is a niche statistical test toolbox. In particular, older versions of package are continuing to be downloaded, sometimes at higher rates than the newest version. What is going on here? Is there a bug in PyPi's downloaded counting, or is there an abundance of crawlers grabbing open source code (as mine is)?

This is kind of an old question at this point, but I noticed the same thing about a package I have on PyPI and investigated further. It turns out PyPI keeps reasonably detailed download statistics, including (apparently slightly anonymised) user agents. From that, it was apparent that most people downloading my package were things like "z3c.pypimirror/1.0.15.1" and "pep381client/1.5". (PEP 381 describes a mirroring infrastructure for PyPI.) I wrote a quick script to tally everything up, first including all of them and then leaving out the most obvious bots, and it turns out that literally 99% of the download activity for my package was caused by mirrorbots: 14,335 downloads total, compared to only 146 downloads with the bots filtered. And that's just leaving out the very obvious ones, so it's probably still an overestimate. It looks like the main reason PyPI needs mirrors is because it has them.

Starting with Cairnarvon's summarizing statement: <blockquote> "It looks like the main reason PyPI needs mirrors is because it has them." </blockquote> I would slightly modify this: <blockquote> It might be more the way PyPI actually works and thus has to be mirrored, that might contribute an additional bit (or two :-) to the real traffic. </blockquote> At the moment I think you MUST interact with the main index to know what to update in your repository. State is not simply accesible through timestamps on some publicly accessible folder hierarchy. So, the bad thing is, rsync is out of the equation. The good thing is, you MAY talk to the index through JSON, OAuth, XML-RPC or HTTP interfaces. For XML-RPC: <pre class="prettyprint"><code>$> python >>> import xmlrpclib >>> import pprint >>> client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi') >>> client.package_releases('PartitionSets') ['0.1.1'] </code></pre> For JSON eg.: <pre class="prettyprint"><code>$> curl https://pypi.python.org/pypi/PartitionSets/0.1.1/json </code></pre> If there are approx. 30.000 packages hosted [1] with some being downloaded 50.000 to 300.000 times a week [2] (like distribute, pip, requests, paramiko, lxml, boto, paramike, redis and others) you really need mirrors at least from an accessibilty perspective. Just imagine what a user does when <code>pip install NeedThisPackage</code> fails: Wait? Also company wide PyPI mirrors are quite common acting as proxies for otherwise unrouteable networks. Finally not to forget the wonderful multi version checking enabled through virtualenv and friends. These all are IMO legitimate and potentially wonderful uses of packages ... In the end, you never know what an agent really does with a downloaded package: Have N users really use it or just overwrite it next time ... and after all, IMHO package authors should care more for number and nature of uses, than the pure number of potential users ;-) <hr> Refs: The guestimated numbers are from https://pypi.python.org/pypi (29303 packages) and http://pypi-ranking.info/week (for the weekly numbers, requested 2013-03-23).

PyPi download counts seem unrealistic

Tags:

python

pypi

web-crawler

I put a package on PyPi for the first time ~2 months ago, and have made some version updates since then. I noticed this week the download count recording, and was surprised to see it had been downloaded hundreds of times. Over the next few days, I was more surprised to see the download count increasing by sometimes hundreds per day, even though this is a niche statistical test toolbox. In particular, older versions of package are continuing to be downloaded, sometimes at higher rates than the newest version.

What is going on here?

Is there a bug in PyPi's downloaded counting, or is there an abundance of crawlers grabbing open source code (as mine is)?

819

asked Mar 10 '12 16:03

jeffalstott

4 Answers

This is kind of an old question at this point, but I noticed the same thing about a package I have on PyPI and investigated further. It turns out PyPI keeps reasonably detailed download statistics, including (apparently slightly anonymised) user agents. From that, it was apparent that most people downloading my package were things like "z3c.pypimirror/1.0.15.1" and "pep381client/1.5". (PEP 381 describes a mirroring infrastructure for PyPI.)

I wrote a quick script to tally everything up, first including all of them and then leaving out the most obvious bots, and it turns out that literally 99% of the download activity for my package was caused by mirrorbots: 14,335 downloads total, compared to only 146 downloads with the bots filtered. And that's just leaving out the very obvious ones, so it's probably still an overestimate.

It looks like the main reason PyPI needs mirrors is because it has them.

115

answered Oct 17 '22 14:10

Cairnarvon

Starting with Cairnarvon's summarizing statement:

"It looks like the main reason PyPI needs mirrors is because it has them."

I would slightly modify this:

It might be more the way PyPI actually works and thus has to be mirrored, that might contribute an additional bit (or two :-) to the real traffic.

At the moment I think you MUST interact with the main index to know what to update in your repository. State is not simply accesible through timestamps on some publicly accessible folder hierarchy. So, the bad thing is, rsync is out of the equation. The good thing is, you MAY talk to the index through JSON, OAuth, XML-RPC or HTTP interfaces.

For XML-RPC:

$> python
>>> import xmlrpclib
>>> import pprint
>>> client = xmlrpclib.ServerProxy('http://pypi.python.org/pypi')
>>> client.package_releases('PartitionSets')
['0.1.1']

For JSON eg.:

$> curl https://pypi.python.org/pypi/PartitionSets/0.1.1/json

If there are approx. 30.000 packages hosted [1] with some being downloaded 50.000 to 300.000 times a week [2] (like distribute, pip, requests, paramiko, lxml, boto, paramike, redis and others) you really need mirrors at least from an accessibilty perspective. Just imagine what a user does when pip install NeedThisPackage fails: Wait? Also company wide PyPI mirrors are quite common acting as proxies for otherwise unrouteable networks. Finally not to forget the wonderful multi version checking enabled through virtualenv and friends. These all are IMO legitimate and potentially wonderful uses of packages ...

In the end, you never know what an agent really does with a downloaded package: Have N users really use it or just overwrite it next time ... and after all, IMHO package authors should care more for number and nature of uses, than the pure number of potential users ;-)

Refs: The guestimated numbers are from https://pypi.python.org/pypi (29303 packages) and http://pypi-ranking.info/week (for the weekly numbers, requested 2013-03-23).

answered Oct 17 '22 14:10

Dilettant

You also have to take into account that virtualenv is getting more popular. If your package is something like a core library that people use in many of their projects, they will usually download it multiple times.

Consider a single user has 5 projects where he uses your package and each lives in its own virtualenv. Using pip to meet the requirements, your package is already downloaded 5 times this way. Then these projects might be set up on different machines, like work, home and laptop computers, in addition there might be a staging and a live server in case of a web application. Summing this up, you end up with many downloads by a single person.

Just a thought... perhaps your package is simply good. ;)

answered Oct 17 '22 12:10

Dirk Eschler

Hypothesis: CI tools like Travis CI and Appveyor also contribute quite a bit. It might mean that each commit/push leads to a build of a package and the installation of everything in requirements.txt

answered Oct 17 '22 12:10

Albert-Jan

Related questions
                            
                                How to run a code in an Amazone's EC2 instance?
                            
                                Paramiko : Error reading SSH protocol banner
                            
                                How can I profile a multithread program in Python?
                            
                                Django and Bootstrap: What app is recommended? [closed]
                            
                                Why does the size of this Python String change on a failed int conversion
                            
                                How do I make a dictionary with multiple keys to one value?
                            
                                Importing modules inside python class
                            
                                HDF5 - concurrency, compression & I/O performance [closed]
                            
                                Calling a Python function with *args,**kwargs and optional / default arguments
                            
                                How to recognize whether a script is running on a tty?
                            
                                English grammar for parsing in NLTK
                            
                                Where to put a configuration file in Python?
                            
                                Which PEP's are must reads?
                            
                                In python is there a way to check if a function is a "generator function" before calling it?
                            
                                Underscore _ as variable name in Python [duplicate]
                            
                                Multiple assignment semantics
                            
                                Call Class Method From Another Class
                            
                                Match groups in Python
                            
                                Django - FileField check if None
                            
                                Performance comparison of Thrift, Protocol Buffers, JSON, EJB, other?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With