I'm trying to scrape a site that requires the user to enter the search value and a captcha. I've got an optical character recognition (OCR) routine for the captcha that succeeds about 33% of the time. Since the captchas are always alphabetic text, I want to reload the captcha if the OCR function returns non-alphabetic characters. Once I have a text "word", I want to submit the search form.
The results come back in the same page, with the form ready for a new search and a new captcha. So I need to rinse and repeat until I've exhausted my search terms.
Here's the top-level algorithm:
I've tried using a pipeline for getting the captcha, but then I don't have the value for the form submission. If I just fetch the image without going through the framework, using urllib or something, then the cookie with the session is not submitted, so the captcha validation on the server fails.
What's the ideal Scrapy way of doing this?
While automating Captcha is not the best practice, there are three efficient ways of handling Captcha in Selenium: By disabling the Captcha in the testing environment. Adding a hook to click the Captcha checkbox. By adding a delay to the Webdriver and manually solve Captcha while testing.
Simple CAPTCHAs can be bypassed using the Optical Character Recognition (OCR) technology that recognizes the text inside images, such as scanned documents and photographs. This technology converts images containing written text into machine-readable text data.
Unfortunately, you cannot disable CAPTCHA, otherwise, you will be receiving a lot of spam via the form.
It's a really deep topic with a bunch of solutions. But if you want to apply the logic you've defined in your post you can use scrapy Downloader Middlewares.
Something like:
class CaptchaMiddleware(object):
max_retries = 5
def process_response(request, response, spider):
if not request.meta.get('solve_captcha', False):
return response # only solve requests that are marked with meta key
catpcha = find_catpcha(response)
if not captcha: # it might not have captcha at all!
return response
solved = solve_captcha(captcha)
if solved:
response.meta['catpcha'] = captcha
response.meta['solved_catpcha'] = solved
return response
else:
# retry page for new captcha
# prevent endless loop
if request.meta.get('catpcha_retries', 0) == max_retries:
logging.warning('max retries for captcha reached for {}'.format(request.url))
raise IgnoreRequest
request.meta['dont_filter'] = True
request.meta['captcha_retries'] = request.meta.get('captcha_retries', 0) + 1
return request
This example will intercept every response and try to solve the captcha. If failed it will retry the page for new captcha, if successful it will add some meta keys to response with solved captcha values.
In your spider you would use it like this:
class MySpider(scrapy.Spider):
def parse(self, response):
url = ''# url that requires captcha
yield Request(url, callback=self.parse_captchad, meta={'solve_captcha': True},
errback=self.parse_fail)
def parse_captchad(self, response):
solved = response['solved']
# do stuff
def parse_fail(self, response):
# failed to retrieve captcha in 5 tries :(
# do stuff
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With