I've been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn't manage it. At first I tried regular multithreading approach with threading module but it wasn't faster than using a single thread. Later I learnt that requests is blocking and multithreading approach isn't really working. So I kept researching and found out about grequests and gevent. Now I'm running tests with gevent and it's still not faster than using a single thread. Is my coding wrong? Here is the relevant part of my class: <pre class="prettyprint"><code>import gevent.monkey from gevent.pool import Pool import requests gevent.monkey.patch_all() class Test: def __init__(self): self.session = requests.Session() self.pool = Pool(20) self.urls = [...urls...] def fetch(self, url): try: response = self.session.get(url, headers=self.headers) except: self.logger.error('Problem: ', id, exc_info=True) self.doSomething(response) def async(self): for url in self.urls: self.pool.spawn( self.fetch, url ) self.pool.join() test = Test() test.async() </code></pre>

Install the <code>grequests</code> module which works with <code>gevent</code> (<code>requests</code> is not designed for async): <pre class="prettyprint"><code>pip install grequests </code></pre> Then change the code to something like this: <pre class="prettyprint"><code>import grequests class Test: def __init__(self): self.urls = [ 'http://www.example.com', 'http://www.google.com', 'http://www.yahoo.com', 'http://www.stackoverflow.com/', 'http://www.reddit.com/' ] def exception(self, request, exception): print "Problem: {}: {}".format(request.url, exception) def async(self): results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5) print results test = Test() test.async() </code></pre> This is officially recommended by the <code>requests</code> project: <blockquote> Blocking Or Non-Blocking? With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The <code>Response.content</code> property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block. If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are <code>grequests</code> and <code>requests-futures</code>. </blockquote> Using this method gives me a noticable performance increase with 10 URLs: <code>0.877s</code> vs <code>3.852s</code> with your original method.

Python requests with multithreading

Tags:

python

asynchronous

multithreading

python-requests

gevent

I've been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn't manage it. At first I tried regular multithreading approach with threading module but it wasn't faster than using a single thread. Later I learnt that requests is blocking and multithreading approach isn't really working. So I kept researching and found out about grequests and gevent. Now I'm running tests with gevent and it's still not faster than using a single thread. Is my coding wrong?

Here is the relevant part of my class:

import gevent.monkey
from gevent.pool import Pool
import requests

gevent.monkey.patch_all()

class Test:
    def __init__(self):
        self.session = requests.Session()
        self.pool = Pool(20)
        self.urls = [...urls...]

    def fetch(self, url):

        try:
            response = self.session.get(url, headers=self.headers)
        except:
            self.logger.error('Problem: ', id, exc_info=True)

        self.doSomething(response)

    def async(self):
        for url in self.urls:
            self.pool.spawn( self.fetch, url )

        self.pool.join()

test = Test()
test.async()

207

asked Jul 09 '16 08:07

krypt

1 Answers

Install the grequests module which works with gevent (requests is not designed for async):

pip install grequests

Then change the code to something like this:

import grequests

class Test:
    def __init__(self):
        self.urls = [
            'http://www.example.com',
            'http://www.google.com', 
            'http://www.yahoo.com',
            'http://www.stackoverflow.com/',
            'http://www.reddit.com/'
        ]

    def exception(self, request, exception):
        print "Problem: {}: {}".format(request.url, exception)

    def async(self):
        results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
        print results

test = Test()
test.async()

This is officially recommended by the requests project:

Blocking Or Non-Blocking?

With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequests and requests-futures.

Using this method gives me a noticable performance increase with 10 URLs: 0.877s vs 3.852s with your original method.

answered Oct 13 '22 03:10

Will

Related questions
                            
                                How to implement user_loader callback in Flask-Login
                            
                                plt.show() making terminal hang
                            
                                Numpy.dot TypeError: Cannot cast array data from dtype('float64') to dtype('S32') according to the rule 'safe'
                            
                                Can variable names in Python start with an integer?
                            
                                Async generator is not an iterator?
                            
                                Visualizing your code's architecture
                            
                                What (pure) Python library to use for AES 256 encryption? [closed]
                            
                                Merge multi-indexed with single-indexed data frames in pandas
                            
                                How to debug a Python module run with python -m from the command line?
                            
                                python equivalent to perl's qw()
                            
                                Is python's "set" stable?
                            
                                Python Implementation of OPTICS (Clustering) Algorithm
                            
                                How are methods, `classmethod`, and `staticmethod` implemented in Python?
                            
                                What does Python's dir() function stand for? [duplicate]
                            
                                XML Parsing: Element Tree (etree) vs. minidom [duplicate]
                            
                                Importing custom module into jupyter notebook
                            
                                What is the scope of a defaulted parameter in Python?
                            
                                Copying a stream in Python
                            
                                How can I fit a Bézier curve to a set of data?
                            
                                How do you mock patch a python class and get a new Mock object for each instantiation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With