Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to send 100,000 HTTP requests in Python?

I am opening a file which has 100,000 URL's. I need to send an HTTP request to each URL and print the status code. I am using Python 2.6, and so far looked at the many confusing ways Python implements threading/concurrency. I have even looked at the python concurrence library, but cannot figure out how to write this program correctly. Has anyone come across a similar problem? I guess generally I need to know how to perform thousands of tasks in Python as fast as possible - I suppose that means 'concurrently'.

like image 971
IgorGanapolsky Avatar asked Apr 13 '10 19:04

IgorGanapolsky


People also ask

How do you send multiple requests in python?

The built-in concurrent library So, threads in Python have more to do with concurrency, than with parallelism. Lines 1–3 are the imported libraries we need. We'll use the requests library for sending HTTP requests to the API, and we'll use the concurrent library for executing them concurrently.

How many requests can Python handle?

Performance with current configuration So with current configuration server can handle 10 requests per second. If the handler is changed to time. sleep(0.5) , i.e if each request could respond in approximately 0.5 seconds then server would be able to handle 20 requests per second.


2 Answers

Twistedless solution:

from urlparse import urlparse from threading import Thread import httplib, sys from Queue import Queue  concurrent = 200  def doWork():     while True:         url = q.get()         status, url = getStatus(url)         doSomethingWithResult(status, url)         q.task_done()  def getStatus(ourl):     try:         url = urlparse(ourl)         conn = httplib.HTTPConnection(url.netloc)            conn.request("HEAD", url.path)         res = conn.getresponse()         return res.status, ourl     except:         return "error", ourl  def doSomethingWithResult(status, url):     print status, url  q = Queue(concurrent * 2) for i in range(concurrent):     t = Thread(target=doWork)     t.daemon = True     t.start() try:     for url in open('urllist.txt'):         q.put(url.strip())     q.join() except KeyboardInterrupt:     sys.exit(1) 

This one is slighty faster than the twisted solution and uses less CPU.

like image 93
Tarnay Kálmán Avatar answered Oct 03 '22 05:10

Tarnay Kálmán


Things have changed quite a bit since 2010 when this was posted and I haven't tried all the other answers but I have tried a few, and I found this to work the best for me using python3.6.

I was able to fetch about ~150 unique domains per second running on AWS.

import concurrent.futures import requests import time  out = [] CONNECTIONS = 100 TIMEOUT = 5  tlds = open('../data/sample_1k.txt').read().splitlines() urls = ['http://{}'.format(x) for x in tlds[1:]]  def load_url(url, timeout):     ans = requests.head(url, timeout=timeout)     return ans.status_code  with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:     future_to_url = (executor.submit(load_url, url, TIMEOUT) for url in urls)     time1 = time.time()     for future in concurrent.futures.as_completed(future_to_url):         try:             data = future.result()         except Exception as exc:             data = str(type(exc))         finally:             out.append(data)              print(str(len(out)),end="\r")      time2 = time.time()  print(f'Took {time2-time1:.2f} s') 
like image 26
Glen Thompson Avatar answered Oct 03 '22 05:10

Glen Thompson