I've been trying to build a scraper with multithreading functionality past two days. Somehow I still couldn't manage it. At first I tried regular multithreading approach with threading module but it wasn't faster than using a single thread. Later I learnt that requests is blocking and multithreading approach isn't really working. So I kept researching and found out about grequests and gevent. Now I'm running tests with gevent and it's still not faster than using a single thread. Is my coding wrong?
Here is the relevant part of my class:
import gevent.monkey
from gevent.pool import Pool
import requests
gevent.monkey.patch_all()
class Test:
def __init__(self):
self.session = requests.Session()
self.pool = Pool(20)
self.urls = [...urls...]
def fetch(self, url):
try:
response = self.session.get(url, headers=self.headers)
except:
self.logger.error('Problem: ', id, exc_info=True)
self.doSomething(response)
def async(self):
for url in self.urls:
self.pool.spawn( self.fetch, url )
self.pool.join()
test = Test()
test.async()
Multithreading in PythonBy default, your Python programs have a single thread, called the main thread. You can create threads by passing a function to the Thread() constructor or by inheriting the Thread class and overriding the run() method.
Yes, you can do this. Even writing data shouldn't be a problem. The only risks is that you could overload your data connection(s) a lot faster or that your server may not allow this. The other risks are inherent to doing multithreading badly.
This is why Python multithreading can provide a large speed increase. The processor can switch between the threads whenever one of them is ready to do some work. Using the threading module in Python or any other interpreted language with a GIL can actually result in reduced performance.
Install the grequests
module which works with gevent
(requests
is not designed for async):
pip install grequests
Then change the code to something like this:
import grequests
class Test:
def __init__(self):
self.urls = [
'http://www.example.com',
'http://www.google.com',
'http://www.yahoo.com',
'http://www.stackoverflow.com/',
'http://www.reddit.com/'
]
def exception(self, request, exception):
print "Problem: {}: {}".format(request.url, exception)
def async(self):
results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
print results
test = Test()
test.async()
This is officially recommended by the requests
project:
Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The
Response.content
property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are
grequests
andrequests-futures
.
Using this method gives me a noticable performance increase with 10 URLs: 0.877s
vs 3.852s
with your original method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With