Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python requests with multithreading

I've been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn't manage it. At first I tried regular multithreading approach with threading module but it wasn't faster than using a single thread. Later I learnt that requests is blocking and multithreading approach isn't really working. So I kept researching and found out about grequests and gevent. Now I'm running tests with gevent and it's still not faster than using a single thread. Is my coding wrong?

Here is the relevant part of my class:

import gevent.monkey
from gevent.pool import Pool
import requests

gevent.monkey.patch_all()

class Test:
    def __init__(self):
        self.session = requests.Session()
        self.pool = Pool(20)
        self.urls = [...urls...]

    def fetch(self, url):

        try:
            response = self.session.get(url, headers=self.headers)
        except:
            self.logger.error('Problem: ', id, exc_info=True)

        self.doSomething(response)

    def async(self):
        for url in self.urls:
            self.pool.spawn( self.fetch, url )

        self.pool.join()

test = Test()
test.async()
like image 207
krypt Avatar asked Jul 09 '16 08:07

krypt


People also ask

Are Python requests multithreaded?

Multithreading in PythonBy default, your Python programs have a single thread, called the main thread. You can create threads by passing a function to the Thread() constructor or by inheriting the Thread class and overriding the run() method.

Can you multithread API calls?

Yes, you can do this. Even writing data shouldn't be a problem. The only risks is that you could overload your data connection(s) a lot faster or that your server may not allow this. The other risks are inherent to doing multithreading badly.

Does multithreading in Python improve performance?

This is why Python multithreading can provide a large speed increase. The processor can switch between the threads whenever one of them is ready to do some work. Using the threading module in Python or any other interpreted language with a GIL can actually result in reduced performance.


1 Answers

Install the grequests module which works with gevent (requests is not designed for async):

pip install grequests

Then change the code to something like this:

import grequests

class Test:
    def __init__(self):
        self.urls = [
            'http://www.example.com',
            'http://www.google.com', 
            'http://www.yahoo.com',
            'http://www.stackoverflow.com/',
            'http://www.reddit.com/'
        ]

    def exception(self, request, exception):
        print "Problem: {}: {}".format(request.url, exception)

    def async(self):
        results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
        print results

test = Test()
test.async()

This is officially recommended by the requests project:

Blocking Or Non-Blocking?

With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequests and requests-futures.

Using this method gives me a noticable performance increase with 10 URLs: 0.877s vs 3.852s with your original method.

like image 97
Will Avatar answered Oct 13 '22 03:10

Will