I have to make numerous (thousands) of HTTP GET requests to a great deal of websites. This is pretty slow, for reasons that some websites may not respond (or take long to do so), while others time out. As I need as many responses as I can get, setting a small timeout (3-5 seconds) is not in my favour.
I have yet to do any kind of multiprocessing or multi-threading in Python, and I've been reading the documentation for a good while. Here's what I have so far:
import requests
from bs4 import BeautifulSoup
from multiprocessing import Process, Pool
errors = 0
def get_site_content(site):
try :
# start = time.time()
response = requests.get(site, allow_redirects = True, timeout=5)
response.raise_for_status()
content = response.text
except Exception as e:
global errors
errors += 1
return ''
soup = BeautifulSoup(content)
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
return text
sites = ["http://www.example.net", ...]
pool = Pool(processes=5)
results = pool.map(get_site_content, sites)
print results
Now, I want the results that are returned to be joined somehow. This allows two variation:
Each process has a local list/queue that contains the content it has accumulated is joined with the other queues to form a single result, containing all the content for all sites.
Each process writes to a single global queue as it goes along. This would entail some locking mechanism for concurrency checks.
Would multiprocessing or multithreading be the better choice here? How would I accomplish the above with either of the approaches in Python?
Edit:
I did attempt something like the following:
# global
queue = []
with Pool(processes = 5) as pool:
queue.append(pool.map(get_site_contents, sites))
print queue
However, this gives me the following error:
with Pool(processes = 4) as pool:
AttributeError: __exit__
Which I don't quite understand. I'm having a little trouble understanding what exactly pool.map does, past applying the function on every object in the iterable second parameter. Does it return anything? If not, do I append to the global queue from within the function?
pool.map
starts 'n' number of processes that take a function and runs it with an item from the iterable. When such a process finishes and returns, the returned value is stored in a result list in the same position as the input item in the input variable.
eg: if a function is written to calculate square of a number and then a pool.map
is used to run this function on a list of numbers.
def square_this(x):
square = x**2
return square
input_iterable = [2, 3, 4]
pool = Pool(processes=2) # Initalize a pool of 2 processes
result = pool.map(square_this, input_iterable) # Use the pool to run the function on the items in the iterable
pool.close() # this means that no more tasks will be added to the pool
pool.join() # this blocks the program till function is run on all the items
# print the result
print result
...>>[4, 9, 16]
The Pool.map
technique may not be ideal in your case since it will block till all the processes finishes. i.e. If a website does not respond or takes too long to respond your program will be stuck waiting for it. Instead try sub-classing the multiprocessing.Process
in your own class which polls these websites and use Queues to access the results. When you have a satisfactory number of responses you can stop all the processes without having to wait for the remaining requests to finish.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With