Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A very simple multithreading parallel URL fetching (without queue)

I spent a whole day looking for the simplest possible multithreaded URL fetcher in Python, but most scripts I found are using queues or multiprocessing or complex libraries.

Finally I wrote one myself, which I am reporting as an answer. Please feel free to suggest any improvement.

I guess other people might have been looking for something similar.

like image 678
Daniele B Avatar asked Apr 23 '13 23:04

Daniele B


People also ask

Is multithreading truly parallel?

Multithreading on multiple processor cores is truly parallel. Individual microprocessors work together to achieve the result more efficiently. There are multiple parallel, concurrent tasks happening at once.

Is Python single threaded or multithreaded?

Python is NOT a single-threaded language. Python processes typically use a single thread because of the GIL. Despite the GIL, libraries that perform computationally heavy tasks like numpy, scipy and pytorch utilise C-based implementations under the hood, allowing the use of multiple cores.

Can multiple threads run concurrently?

Within a process or program, we can run multiple threads concurrently to improve the performance. Threads, unlike heavyweight process, are lightweight and run inside a single process – they share the same address space, the resources allocated and the environment of that process.

Is it a good idea to use multi thread to speed your Python code?

Both multithreading and multiprocessing allow Python code to run concurrently. Only multiprocessing will allow your code to be truly parallel. However, if your code is IO-heavy (like HTTP requests), then multithreading will still probably speed up your code.


2 Answers

Simplifying your original version as far as possible:

import threading import urllib2 import time  start = time.time() urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]  def fetch_url(url):     urlHandler = urllib2.urlopen(url)     html = urlHandler.read()     print "'%s\' fetched in %ss" % (url, (time.time() - start))  threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls] for thread in threads:     thread.start() for thread in threads:     thread.join()  print "Elapsed Time: %s" % (time.time() - start) 

The only new tricks here are:

  • Keep track of the threads you create.
  • Don't bother with a counter of threads if you just want to know when they're all done; join already tells you that.
  • If you don't need any state or external API, you don't need a Thread subclass, just a target function.
like image 101
abarnert Avatar answered Oct 14 '22 20:10

abarnert


multiprocessing has a thread pool that doesn't start other processes:

#!/usr/bin/env python from multiprocessing.pool import ThreadPool from time import time as timer from urllib2 import urlopen  urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]  def fetch_url(url):     try:         response = urlopen(url)         return url, response.read(), None     except Exception as e:         return url, None, e  start = timer() results = ThreadPool(20).imap_unordered(fetch_url, urls) for url, html, error in results:     if error is None:         print("%r fetched in %ss" % (url, timer() - start))     else:         print("error fetching %r: %s" % (url, error)) print("Elapsed Time: %s" % (timer() - start,)) 

The advantages compared to Thread-based solution:

  • ThreadPool allows to limit the maximum number of concurrent connections (20 in the code example)
  • the output is not garbled because all output is in the main thread
  • errors are logged
  • the code works on both Python 2 and 3 without changes (assuming from urllib.request import urlopen on Python 3).
like image 39
jfs Avatar answered Oct 14 '22 18:10

jfs