How do I set up Scrapy to deal with a captcha

Tags:

I'm trying to scrape a site that requires the user to enter the search value and a captcha. I've got an optical character recognition (OCR) routine for the captcha that succeeds about 33% of the time. Since the captchas are always alphabetic text, I want to reload the captcha if the OCR function returns non-alphabetic characters. Once I have a text "word", I want to submit the search form.

The results come back in the same page, with the form ready for a new search and a new captcha. So I need to rinse and repeat until I've exhausted my search terms.

Here's the top-level algorithm:

Load page initially
Download the captcha image, run it through the OCR
If the OCR doesn't come back with a text-only result, refresh the captcha and repeat this step
Submit the query form in the page with search term and captcha
Check the response to see whether the captcha was correct
If it was correct, scrape the data
Go to 2

I've tried using a pipeline for getting the captcha, but then I don't have the value for the form submission. If I just fetch the image without going through the framework, using urllib or something, then the cookie with the session is not submitted, so the captcha validation on the server fails.

What's the ideal Scrapy way of doing this?

702

asked Aug 25 '16 05:08

Sushil

1 Answers

It's a really deep topic with a bunch of solutions. But if you want to apply the logic you've defined in your post you can use scrapy Downloader Middlewares.

Something like:

class CaptchaMiddleware(object):
    max_retries = 5
    def process_response(request, response, spider):
        if not request.meta.get('solve_captcha', False):
            return response  # only solve requests that are marked with meta key
        catpcha = find_catpcha(response)
        if not captcha:  # it might not have captcha at all!
            return response
        solved = solve_captcha(captcha)
        if solved:
            response.meta['catpcha'] = captcha
            response.meta['solved_catpcha'] = solved
            return response
        else:
            # retry page for new captcha
            # prevent endless loop
            if request.meta.get('catpcha_retries', 0) == max_retries:
                logging.warning('max retries for captcha reached for {}'.format(request.url))
                raise IgnoreRequest 
            request.meta['dont_filter'] = True
            request.meta['captcha_retries'] = request.meta.get('captcha_retries', 0) + 1
            return request

This example will intercept every response and try to solve the captcha. If failed it will retry the page for new captcha, if successful it will add some meta keys to response with solved captcha values.
In your spider you would use it like this:

class MySpider(scrapy.Spider):
    def parse(self, response):
        url = ''# url that requires captcha
        yield Request(url, callback=self.parse_captchad, meta={'solve_captcha': True},
                      errback=self.parse_fail)
    
    def parse_captchad(self, response):
        solved = response['solved']
        # do stuff
    
    def parse_fail(self, response):
        # failed to retrieve captcha in 5 tries :(
        # do stuff

200

answered Oct 01 '22 06:10

Granitosaurus

Related questions
                            
                                Refactor with pyCharm from "user" to "self.user"
                            
                                Why doesn't the MySQLdb Connection context manager close the cursor?
                            
                                How do I patch an object so that all methods are mocked except one?
                            
                                Setting column types while reading csv with pandas
                            
                                heatmap-like plot, but for categorical variables in seaborn
                            
                                pandas groupby-apply behavior, returning a Series (inconsistent output type)
                            
                                How (in what form) to share (deliver) a Python function?
                            
                                How to deal with UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape
                            
                                Coverage of Cython module using py.test and coverage.py
                            
                                Python - For loop millions of rows
                            
                                Why do I get "AttributeError: __fields_set__" when subclassing a Pydantic BaseModel?
                            
                                Robust Algorithm to detect uneven illumination in images [Detection Only Needed]
                            
                                Detect in python which keys are pressed
                            
                                Using "from __future__ import division" in my program, but it isn't loaded with my program
                            
                                Web app hangs for several hours in ssl.py at self._sslobj.do_handshake()
                            
                                Pandas Boolean .any() .all()
                            
                                Find the index of the end of a word in python
                            
                                How to create new column and insert row values while iterating through pandas data frame
                            
                                What is a fast and proper way to refresh/update plots in Bokeh (0.11) server app?
                            
                                Django: IntegrityError during Many To Many add()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I set up Scrapy to deal with a captcha

Tags:

python

web-scraping

scrapy

captcha

Sushil

People also ask

1 Answers

Granitosaurus

Recent Activity

Donate For Us