Wait for a Request to complete - Python Scrapy

Tags:

I have a Scrapy Spider which scrapes a website and that website requires to refresh a token to be able to access them.

def get_ad(self, response):
    temp_dict = AppextItem()
    try:
        Selector(response).xpath('//div[@class="messagebox"]').extract()[0]
        print("Captcha found when scraping ID "+ response.meta['id'] + " LINK: "+response.meta['link'])
        self.p_token = ''

        return Request(url = url_, callback=self.get_p_token, method = "GET",priority=1, meta = response.meta)

    except Exception:
        print("Captcha was not found")

I have a get_p_token method that refreshes token and assigns to self.p_token

get_p_token is called when Captcha is found, but problem is, other Requests keep executing.

I want that if Captcha is found, do not make next request until execution of get_p_token is finished.

I have priority=1 but that does not help.

HERE is full code of Spider

P.S:

Actually that token is passed to each URL so that is why I want to wait until a new token is found and then scrape the rest of URLs.

455

asked Oct 03 '16 16:10

Umair Ayub

1 Answers

You should implement your CAPTCHA solving logic as a middleware. See captcha-middleware for inspiration.

The middleware should take care of assigning the right token to requests (from process_request()) and detect CAPTCHA prompts (from process_response()).

Within the middleware, you can use something other than Scrapy (e.g. requests) to perform the requests needed for CAPTCHA solving in a synchronous way that prevents new requests from starting until done.

Of course, any already triggered parallel request would have started already, so it is technically possible for a few requests to be sent without an updated token. However, those should be retried automatically. You can configure your middleware to update the tokens of those requests upon retrying by making sure your middleware works nicely with the retry middleware.

answered Sep 21 '22 01:09

Gallaecio

Related questions
                            
                                How to handle clicks on Links in Python with Gtk 3.0 and WebKit2 4.0?
                            
                                Keep Existing Namespaces when overwriting XML file with ElementTree and Python
                            
                                Can set accept-language header but not Connection header? PhantomJS (Selenium Webdriver with Python)
                            
                                How to specify Python requirements by allowing prereleases?
                            
                                Set up virtualenv with Paramiko SSH
                            
                                pip editable install on read-only filesystem
                            
                                Python - Pandas Output Limits Columns
                            
                                Reading C structures in Python with ctypes
                            
                                Scikit-Learn: Std.Error, p-Value from LinearRegression
                            
                                Python getting element value for specific element
                            
                                Multiprocessing - Shared Array
                            
                                How to load large .mat files in python?
                            
                                Comparing lists containing NaNs
                            
                                Stream a non-seekable file-like object to multiple sinks
                            
                                Print the structure of large nested dictionaries in a compact way without printing all elements
                            
                                How to drop duplicated rows using pandas in a big data file?
                            
                                Failure to build wheel / "Error: INCLUDE Environment Variable is empty"
                            
                                Why is my stack buffer overflow exploit not working?
                            
                                Possible to change directory and have change persist when script finishes?
                            
                                Cache file handle to netCDF files in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Wait for a Request to complete - Python Scrapy

Tags:

python

scrapy

screen-scraping

scrapy-spider

Umair Ayub

People also ask

1 Answers

Gallaecio

Recent Activity

Donate For Us