I am opening a file which has 100,000 URL's. I need to send an HTTP request to each URL and print the status code. I am using Python 2.6, and so far looked at the many confusing ways Python implements threading/concurrency. I have even looked at the python concurrence library, but cannot figure out how to write this program correctly. Has anyone come across a similar problem? I guess generally I need to know how to perform thousands of tasks in Python as fast as possible - I suppose that means 'concurrently'.
The built-in concurrent library So, threads in Python have more to do with concurrency, than with parallelism. Lines 1–3 are the imported libraries we need. We'll use the requests library for sending HTTP requests to the API, and we'll use the concurrent library for executing them concurrently.
Performance with current configuration So with current configuration server can handle 10 requests per second. If the handler is changed to time. sleep(0.5) , i.e if each request could respond in approximately 0.5 seconds then server would be able to handle 20 requests per second.
Twistedless solution:
from urlparse import urlparse from threading import Thread import httplib, sys from Queue import Queue concurrent = 200 def doWork(): while True: url = q.get() status, url = getStatus(url) doSomethingWithResult(status, url) q.task_done() def getStatus(ourl): try: url = urlparse(ourl) conn = httplib.HTTPConnection(url.netloc) conn.request("HEAD", url.path) res = conn.getresponse() return res.status, ourl except: return "error", ourl def doSomethingWithResult(status, url): print status, url q = Queue(concurrent * 2) for i in range(concurrent): t = Thread(target=doWork) t.daemon = True t.start() try: for url in open('urllist.txt'): q.put(url.strip()) q.join() except KeyboardInterrupt: sys.exit(1)
This one is slighty faster than the twisted solution and uses less CPU.
Things have changed quite a bit since 2010 when this was posted and I haven't tried all the other answers but I have tried a few, and I found this to work the best for me using python3.6.
I was able to fetch about ~150 unique domains per second running on AWS.
import concurrent.futures import requests import time out = [] CONNECTIONS = 100 TIMEOUT = 5 tlds = open('../data/sample_1k.txt').read().splitlines() urls = ['http://{}'.format(x) for x in tlds[1:]] def load_url(url, timeout): ans = requests.head(url, timeout=timeout) return ans.status_code with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor: future_to_url = (executor.submit(load_url, url, TIMEOUT) for url in urls) time1 = time.time() for future in concurrent.futures.as_completed(future_to_url): try: data = future.result() except Exception as exc: data = str(type(exc)) finally: out.append(data) print(str(len(out)),end="\r") time2 = time.time() print(f'Took {time2-time1:.2f} s')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With