Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?

Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.

The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:

import requests
from splinter import Browser    

browser = Browser('chrome')

# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)

So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?

like image 512
Jo Ko Avatar asked May 05 '17 16:05

Jo Ko


2 Answers

  1. Decide to use either requests or splinter

    Read about Requests: HTTP for Humans
    Read about Splinter

  2. Related

    Read about keep-alive
    Read about blocking-or-non-blocking
    Read about timeouts
    Read about errors-and-exceptions

If you are able to get not hanging requests, you can think of repeated requests, for instance:

while True:
    requests.get(...
    if request is succesfull:
        break

    time.sleep(1)
like image 87
stovfl Avatar answered Sep 28 '22 01:09

stovfl


Gevent provides a framework for running asynchronous network requests.

It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.

Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.

from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests

pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
             for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
    # This will raise any exceptions raised by the request
    # Need to catch errors, or check if an exception was
    # thrown by checking `greenlet.exception`
    response = greenlet.get()
    text_response = response.text

Could also use map and a response handling function instead of get.

See gevent documentation for more information.

like image 26
danny Avatar answered Sep 28 '22 01:09

danny