Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wait for a Request to complete - Python Scrapy

I have a Scrapy Spider which scrapes a website and that website requires to refresh a token to be able to access them.

def get_ad(self, response):
    temp_dict = AppextItem()
    try:
        Selector(response).xpath('//div[@class="messagebox"]').extract()[0]
        print("Captcha found when scraping ID "+ response.meta['id'] + " LINK: "+response.meta['link'])
        self.p_token = ''

        return Request(url = url_, callback=self.get_p_token, method = "GET",priority=1, meta = response.meta)

    except Exception:
        print("Captcha was not found")

I have a get_p_token method that refreshes token and assigns to self.p_token

get_p_token is called when Captcha is found, but problem is, other Requests keep executing.

I want that if Captcha is found, do not make next request until execution of get_p_token is finished.

I have priority=1 but that does not help.

HERE is full code of Spider

P.S:

Actually that token is passed to each URL so that is why I want to wait until a new token is found and then scrape the rest of URLs.

like image 455
Umair Ayub Avatar asked Oct 03 '16 16:10

Umair Ayub


People also ask

How do you do a delayed request on Scrapy?

if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it. But scrapy also has a feature to automatically set download delays called AutoThrottle . It automatically sets delays based on load of both the Scrapy server and the website you are crawling.

How do you get a response from Scrapy request?

Request usage examples You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.

How do you use Scrapy requests?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you make a Scrapy request?

Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.


1 Answers

You should implement your CAPTCHA solving logic as a middleware. See captcha-middleware for inspiration.

The middleware should take care of assigning the right token to requests (from process_request()) and detect CAPTCHA prompts (from process_response()).

Within the middleware, you can use something other than Scrapy (e.g. requests) to perform the requests needed for CAPTCHA solving in a synchronous way that prevents new requests from starting until done.

Of course, any already triggered parallel request would have started already, so it is technically possible for a few requests to be sent without an updated token. However, those should be retried automatically. You can configure your middleware to update the tokens of those requests upon retrying by making sure your middleware works nicely with the retry middleware.

like image 60
Gallaecio Avatar answered Sep 21 '22 01:09

Gallaecio