Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Captchas in Scrapy

I'm working on a Scrapy app, where I'm trying to login to a site with a form that uses a captcha (It's not spam). I am using ImagesPipeline to download the captcha, and I am printing it to the screen for the user to solve. So far so good.

My question is how can I restart the spider, to submit the captcha/form information? Right now my spider requests the captcha page, then returns an Item containing the image_url of the captcha. This is then processed/downloaded by the ImagesPipeline, and displayed to the user. I'm unclear how I can resume the spider's progress, and pass the solved captcha and same session to the spider, as I believe the spider has to return the item (e.g. quit) before the ImagesPipeline goes to work.

I've looked through the docs and examples but I haven't found any ones that make it clear how to make this happen.

like image 347
Kevin Burke Avatar asked Jul 11 '11 05:07

Kevin Burke


1 Answers

This is how you might get it to work inside the spider.

self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()

Once you get the request, pause the engine, display the image, read the info from the user& resume the crawl by submitting a POST request for login.

I'd be interested to know if the approach works for your case.

like image 93
user Avatar answered Oct 13 '22 00:10

user