What is the fastest way to send 100,000 HTTP requests in Python?

Tags:

I am opening a file which has 100,000 URL's. I need to send an HTTP request to each URL and print the status code. I am using Python 2.6, and so far looked at the many confusing ways Python implements threading/concurrency. I have even looked at the python concurrence library, but cannot figure out how to write this program correctly. Has anyone come across a similar problem? I guess generally I need to know how to perform thousands of tasks in Python as fast as possible - I suppose that means 'concurrently'.

971

asked Apr 13 '10 19:04

IgorGanapolsky

2 Answers

Twistedless solution:

from urlparse import urlparse from threading import Thread import httplib, sys from Queue import Queue  concurrent = 200  def doWork():     while True:         url = q.get()         status, url = getStatus(url)         doSomethingWithResult(status, url)         q.task_done()  def getStatus(ourl):     try:         url = urlparse(ourl)         conn = httplib.HTTPConnection(url.netloc)            conn.request("HEAD", url.path)         res = conn.getresponse()         return res.status, ourl     except:         return "error", ourl  def doSomethingWithResult(status, url):     print status, url  q = Queue(concurrent * 2) for i in range(concurrent):     t = Thread(target=doWork)     t.daemon = True     t.start() try:     for url in open('urllist.txt'):         q.put(url.strip())     q.join() except KeyboardInterrupt:     sys.exit(1)

This one is slighty faster than the twisted solution and uses less CPU.

answered Oct 03 '22 05:10

Tarnay Kálmán

Things have changed quite a bit since 2010 when this was posted and I haven't tried all the other answers but I have tried a few, and I found this to work the best for me using python3.6.

I was able to fetch about ~150 unique domains per second running on AWS.

import concurrent.futures import requests import time  out = [] CONNECTIONS = 100 TIMEOUT = 5  tlds = open('../data/sample_1k.txt').read().splitlines() urls = ['http://{}'.format(x) for x in tlds[1:]]  def load_url(url, timeout):     ans = requests.head(url, timeout=timeout)     return ans.status_code  with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:     future_to_url = (executor.submit(load_url, url, TIMEOUT) for url in urls)     time1 = time.time()     for future in concurrent.futures.as_completed(future_to_url):         try:             data = future.result()         except Exception as exc:             data = str(type(exc))         finally:             out.append(data)              print(str(len(out)),end="\r")      time2 = time.time()  print(f'Took {time2-time1:.2f} s')

answered Oct 03 '22 05:10

Glen Thompson

Related questions
                            
                                `from ... import` vs `import .` [duplicate]
                            
                                Relative paths in Python
                            
                                How to execute a file within the Python interpreter?
                            
                                Convert a number range to another range, maintaining ratio
                            
                                How do I check which version of NumPy I'm using?
                            
                                Getting list of parameter names inside python function [duplicate]
                            
                                Ignore python multiple return value
                            
                                What is PEP8's E128: continuation line under-indented for visual indent?
                            
                                InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately [duplicate]
                            
                                How to install Python MySQLdb module using pip?
                            
                                How to crop an image in OpenCV using Python
                            
                                Test if executable exists in Python?
                            
                                What's the best way to generate a UML diagram from Python source code? [closed]
                            
                                Why doesn't list have safe "get" method like dictionary?
                            
                                How to manage local vs production settings in Django?
                            
                                Best way to find the intersection of multiple sets?
                            
                                What does enumerate() mean?
                            
                                Add Variables to Tuple
                            
                                Reverse Y-Axis in PyPlot
                            
                                What is the best (idiomatic) way to check the type of a Python variable? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to send 100,000 HTTP requests in Python?

Tags:

python

http

concurrency

IgorGanapolsky

People also ask

2 Answers

Tarnay Kálmán

Glen Thompson

Recent Activity

Donate For Us