I have a Scrapy Spider which scrapes a website and that website requires to refresh a token to be able to access them.
def get_ad(self, response):
temp_dict = AppextItem()
try:
Selector(response).xpath('//div[@class="messagebox"]').extract()[0]
print("Captcha found when scraping ID "+ response.meta['id'] + " LINK: "+response.meta['link'])
self.p_token = ''
return Request(url = url_, callback=self.get_p_token, method = "GET",priority=1, meta = response.meta)
except Exception:
print("Captcha was not found")
I have a get_p_token
method that refreshes token and assigns to self.p_token
get_p_token
is called when Captcha is found, but problem is, other Requests keep executing.
I want that if Captcha is found, do not make next request until execution of get_p_token
is finished.
I have priority=1
but that does not help.
HERE is full code of Spider
P.S:
Actually that token is passed to each URL so that is why I want to wait until a new token is found and then scrape the rest of URLs.
if you want to keep a download delay of exactly one second, setting DOWNLOAD_DELAY=1 is the way to do it. But scrapy also has a feature to automatically set download delays called AutoThrottle . It automatically sets delays based on load of both the Scrapy server and the website you are crawling.
Request usage examples You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.
You should implement your CAPTCHA solving logic as a middleware. See captcha-middleware for inspiration.
The middleware should take care of assigning the right token to requests (from process_request()
) and detect CAPTCHA prompts (from process_response()
).
Within the middleware, you can use something other than Scrapy (e.g. requests) to perform the requests needed for CAPTCHA solving in a synchronous way that prevents new requests from starting until done.
Of course, any already triggered parallel request would have started already, so it is technically possible for a few requests to be sent without an updated token. However, those should be retried automatically. You can configure your middleware to update the tokens of those requests upon retrying by making sure your middleware works nicely with the retry middleware.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With