Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - simple captcha solving example

When looking online for Scrapy to solve a captcha, I see no good example to even start with.

I've created a very basic captcha page. http://145.100.108.148/login3/

Is there someone with a working example to solve this, or at least configured Scrapy in a decent fashion trying to solve it.

like image 503
Kevin C Avatar asked Jan 16 '18 11:01

Kevin C


1 Answers

Solving the captcha itself is easy using Pillow and Python Tesseract. The hard part was to realize how to handle cookies (PHPSESSID). Here's complete working example for your case (using Python 2):

# -*- coding: utf-8 -*-                                                         
import io                                                                       
import urllib2                                                                  

from PIL import Image                                                           
import pytesseract                                                              
import scrapy                                                                   


class CaptchaSpider(scrapy.Spider):                                             
    name = 'captcha'                                                            

    def start_requests(self):                                                   
        yield scrapy.Request('http://145.100.108.148/login3/',                  
                             cookies={'PHPSESSID': 'xyz'})                      

    def parse(self, response):                                                  
        img_url = response.urljoin(response.xpath('//img/@src').extract_first())

        url_opener = urllib2.build_opener()                                     
        url_opener.addheaders.append(('Cookie', 'PHPSESSID=xyz'))               
        img_bytes = url_opener.open(img_url).read()                             
        img = Image.open(io.BytesIO(img_bytes))                                 

        captcha = pytesseract.image_to_string(img)                              
        print 'Captcha solved:', captcha                                        

        return scrapy.FormRequest.from_response(                                
            response, formdata={'captcha': captcha},                            
            callback=self.after_captcha)                                        

    def after_captcha(self, response):                                          
        print 'Result:', response.body
like image 98
Tomáš Linhart Avatar answered Sep 28 '22 00:09

Tomáš Linhart