I'm looking for a python library or a command line tool for downloading multiple files in parallel. My current solution is to download the files sequentially which is slow. I know you can easily write a half-assed threaded solution in python, but I always run into annoying problem when using threading. It is for polling a large number of xml feeds from websites.
My requirements for the solution are:
Please don't suggest how I may go about implementing the above requirements. I'm looking for a ready-made, battle-tested solution.
I guess I should describe what I want it for too... I have about 300 different data feeds as xml formatted files served from 50 data providers. Each file is between 100kb and 5mb in size. I need to poll them frequently (as in once every few minutes) to determine if any of them has new data I need to process. So it is important that the downloader uses http caching to minimize the amount of data to fetch. It also uses gzip compression obviously.
Then the big problem is how to use the bandwidth in an as efficient manner as possible without overstepping any boundaries. For example, one data provider may consider it abuse if you open 20 simultaneous connections to their data feeds. Instead it may be better to use one or two connections that are reused for multiple files. Or your own connection may be limited in strange ways.. My isp limits the number of dns lookups you can do so some kind of dns caching would be nice.
Download multiple files in parallel with Python To start, create a function ( download_parallel ) to handle the parallel download. The function ( download_parallel ) will take one argument, an iterable containing URLs and associated filenames (the inputs variable we created earlier).
As indicated, you can draw a rectangle around the links you want to select. This will highlight the links in yellow. From there you can either hit Enter to open the selected links in the same window, “Shift + Enter” to open in a new window, or “Alt + Enter” to download them.
In order to offer several files together as a download, you only have the option of compressing them in an archive. In my example, all files that match the specified pattern are listed and compressed in a zip archive. This is written to the memory and sent by the server.
In the attachments section with multiple files. Have a checkbox next to each file so you can select multiple files then click on a button to download the selected files. This way if you want to download multiple files you do not have to view each one.
You can try pycurl, though the interface is not easy at first, but once you look at examples, its not hard to understand. I have used it to fetch 1000s of web pages in parallel on meagre linux box.
The only problem is that it provides a basic infrastructure (basically just a python layer above the excellent curl library). You will have to write few lines to achieve the features as you want.
There are lots of options but it will be hard to find one which fits all your needs.
In your case, try this approach:
Use another thread to collect the results (i.e. another queue). When the number of result objects == number of puts in the first queue, then you're finished.
Make sure that all communication goes via the queue or the "config object". Avoid accessing data structures which are shared between threads. This should save you 99% of the problems.
I don't think such a complete library exists, so you'll probably have to write your own. I suggest taking a look at gevent for this task. They even provide a concurrent_download.py example script. Then you can use urllib2 for most of the other requirements, such as handling HTTP status codes, and displaying download progress.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With