I spent a whole day looking for the simplest possible multithreaded URL fetcher in Python, but most scripts I found are using queues or multiprocessing or complex libraries.
Finally I wrote one myself, which I am reporting as an answer. Please feel free to suggest any improvement.
I guess other people might have been looking for something similar.
Multithreading on multiple processor cores is truly parallel. Individual microprocessors work together to achieve the result more efficiently. There are multiple parallel, concurrent tasks happening at once.
Python is NOT a single-threaded language. Python processes typically use a single thread because of the GIL. Despite the GIL, libraries that perform computationally heavy tasks like numpy, scipy and pytorch utilise C-based implementations under the hood, allowing the use of multiple cores.
Within a process or program, we can run multiple threads concurrently to improve the performance. Threads, unlike heavyweight process, are lightweight and run inside a single process – they share the same address space, the resources allocated and the environment of that process.
Both multithreading and multiprocessing allow Python code to run concurrently. Only multiprocessing will allow your code to be truly parallel. However, if your code is IO-heavy (like HTTP requests), then multithreading will still probably speed up your code.
Simplifying your original version as far as possible:
import threading import urllib2 import time start = time.time() urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"] def fetch_url(url): urlHandler = urllib2.urlopen(url) html = urlHandler.read() print "'%s\' fetched in %ss" % (url, (time.time() - start)) threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls] for thread in threads: thread.start() for thread in threads: thread.join() print "Elapsed Time: %s" % (time.time() - start)
The only new tricks here are:
join
already tells you that.Thread
subclass, just a target
function.multiprocessing
has a thread pool that doesn't start other processes:
#!/usr/bin/env python from multiprocessing.pool import ThreadPool from time import time as timer from urllib2 import urlopen urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"] def fetch_url(url): try: response = urlopen(url) return url, response.read(), None except Exception as e: return url, None, e start = timer() results = ThreadPool(20).imap_unordered(fetch_url, urls) for url, html, error in results: if error is None: print("%r fetched in %ss" % (url, timer() - start)) else: print("error fetching %r: %s" % (url, error)) print("Elapsed Time: %s" % (timer() - start,))
The advantages compared to Thread
-based solution:
ThreadPool
allows to limit the maximum number of concurrent connections (20
in the code example)from urllib.request import urlopen
on Python 3).If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With