Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel fetching of files

In order to download files, I'm creating a urlopen object (urllib2 class) and reading it in chunks.

I would like to connect to the server several times and download the file in six different sessions. Doing that, the download speed should get faster. Many download managers have this feature.

I thought about specifying the part of file i would like to download in each session, and somehow process all the sessions in the same time. I'm not sure how I can achieve this.

like image 843
Alex Bazuvul Avatar asked Jan 25 '12 17:01

Alex Bazuvul


3 Answers

As to running parallel requests you might want to use urllib3 or requests.

I took some time to make a list of similar questions:

Looking for [python] +download +concurrent gives these interesting ones:

  • Concurrent downloads - Python
  • What is the fastest way to send 100,000 HTTP requests in Python?
  • Library or tool to download multiple files in parallell
  • Download multiple pages concurrently?
  • Python: simple async download of url content?
  • Python, gevent, urllib2.urlopen.read(), download accelerator
  • Python/Urllib2/Threading: Single download thread faster than multiple download threads. Why?
  • Scraping landing pages of a list of domains
  • A clean, lightweight alternative to Python's twisted?

Looking for [python] +http +concurrent gives these:

  • Python: How to make multiple HTTP POST queries in one moment?
  • Multi threaded web scraper using urlretrieve on a cookie-enabled site

Looking for [python] +urllib2 +slow:

  • Python urllib2.open is slow, need a better way to read several urls
  • Python 2.6: parallel parsing with urllib2
  • How can I speed up fetching pages with urllib2 in python?
  • Threading HTTP requests (with proxies)

Looking for [python] +download +many:

  • Python,multi-threads,fetch webpages,download webpages
  • Downloading files in twisted using queue
  • Python: Something like map that works on threads
  • Rotating Proxies for web scraping
  • Anyone know of a good Python based web crawler that I could use?
like image 91
Piotr Dobrogost Avatar answered Nov 14 '22 19:11

Piotr Dobrogost


Sounds like you want to use one of the flavors of HTTP Range that are available.

edit Updated link to point to the w3.org stored RFC

like image 32
synthesizerpatel Avatar answered Nov 14 '22 19:11

synthesizerpatel


As we've been talking already I made such one using PycURL.

The one, and only one, thing I had to do was pycurl_instance.setopt(pycurl_instance.NOSIGNAL, 1) to prevent crashes.

I did use APScheduler to fire requests in the separate threads. Thanks to your advices of changing busy waiting while True: pass to while True: time.sleep(3) in the main thread the code behaves quite nice and usage of Runner module from python-daemon package application is almost ready to be used as a typical UN*X daemon.

like image 3
Kacper Perschke Avatar answered Nov 14 '22 20:11

Kacper Perschke