Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting Bad Captcha Image from Google ReCaptcha Scraping

I'm trying to load Captcha's faster then rendering them in a WebBrowser Control then copy/pasting the image and rendering it into a picturebox.

Why not just download the picture right into the PictureBox right away which has the advantage of using less CPU Usage and memory pretty much this solution works for any other captcha service which is more advanced called Solve Media (with Solve Media if you view the image url the next time you try to view it, it gives you a fake error catpcha image).

But now I need support for ReCaptcha Captcha system as well for the use of automating my bot at a faster pace then just refreshing a webpage and waiting for it to render.

So I'll just write my code here as far as I understand I'm just missing emulating one of the properties in HTML Request I got the User-Agent faked as a real Internet Explorer 8, I think the problem is the Cookie seems somehow it generates a cookie I can't figure out where, but I also get one Cookie I think from downloading the Javascript file.

Either way Google ReCaptcha tries to trick you with a fake Captcha which you cannot read to rub it in your face that you are not doing something right. I figured out when you see the 2 Black circles then its obvious it's fake.

Here is a example of Bad Captcha and Good Captcha

captchagood captcha

At one point I remember ReCaptcha had another security feature which somehow knew if you loaded the Captcha image from the actual domain where it's placed I don't know how that worked since I download everything locally right? but they seem to have removed that feature anyways. (Actually it exists on some websites seems to be disabled by default, easy to trick it uses Referer header)

I'm not trying to cheat anything here I will still be typing in these Captcha's manually by hand but I want to type them in faster then required rendering the page normally is.

I want the Captcha's to become either those street numbers.. or at least 2 words without those black circles.

Anyways here is my Current Code.

Dim newCaptcha = New Captcha
Dim myUserAgent As String = ""
Dim myReferer As String = "http://www.google.com/recaptcha/demo/"
Dim outputSite As String = HTTP.HTTPGET("http://www.google.com/recaptcha/demo/", "", "", "", myUserAgent, myReferer)
Dim recaptchaChallengeKey = GetBetween(outputSite, "http://www.google.com/recaptcha/api/challenge?k=", """")

'Google ReCaptcha Captcha
outputSite = HTTP.HTTPGET("http://www.google.com/recaptcha/api/challenge?k=" & recaptchaChallengeKey, "", "", "", myUserAgent, myReferer)

'outputSite = outputSite.Replace("var RecaptchaState = {", "{""RecaptchaState"": {")
'outputSite = outputSite.Replace("};", "}}")
'Dim jsonDictionary As Dictionary(Of String, Object) = New JavaScriptSerializer().Deserialize(Of Dictionary(Of String, Object))(outputSite)
Dim recaptchaChallenge = GetBetween(outputSite, "challenge : '", "',")
outputSite = HTTP.HTTPGET("http://www.google.com/recaptcha/api/js/recaptcha.js", "", "", "", myUserAgent, myReferer) 'This page looks useless but it seems the javascript loads this anyways, maybe this why I get bad captchas?

If HTTP.LoadWebImageToPictureBox(newCaptcha.picCaptcha, "http://www.google.com/recaptcha/api/image?c=" & recaptchaChallenge, myUserAgent, myReferer) = False Then
    MessageBox.Show("Recaptcha Image loading failed!")
Else
    Dim newWork As New Work
    newWork.CaptchaForm = newCaptcha
    newWork.AccountId = 1234 'ID of Accounts.
    newWork.CaptchaHash = "recaptcha_challenge_field=" & recaptchaChallenge
    newWork.CaptchaType = "ReCaptcha"
    Works.Add(newWork)
    newCaptcha.Show()
End If

Here is the HTTP class I use.

Imports System.Collections.Generic
Imports System.Linq
Imports System.Text
Imports System.Net
Imports System.IO
Public Class HTTP

    Public StoredCookies As New CookieContainer

    Public Function HTTPGET(ByVal url As String, ByVal proxyname As String, ByVal proxylogin As String, ByVal proxypassword As String, ByVal userAgent As String, ByVal referer As String) As String
        Dim resp As HttpWebResponse
        Dim req As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest)

        If userAgent = "" Then
            userAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
        End If
        req.UserAgent = userAgent
        req.Referer = referer
        req.AllowAutoRedirect = True
        req.ReadWriteTimeout = 5000
        req.CookieContainer = StoredCookies
        req.Headers.Set("Accept-Language", "en-us")

        req.KeepAlive = True
        req.Method = "GET"

        Dim stream_in As StreamReader

        If proxyname <> "" Then
            Dim proxyIP As String = proxyname.Split(New Char() {":"})(0)
            Dim proxyPORT As Integer = CInt(proxyname.Split(New Char() {":"})(1))

            Dim proxy As New WebProxy(proxyIP, proxyPORT)
            'if proxylogin is an empty string then don't use proxy credentials (open proxy)
            If proxylogin <> "" Then
                proxy.Credentials = New NetworkCredential(proxylogin, proxypassword)
            End If
            req.Proxy = proxy
        End If

        Dim response As String = ""
        Try
            resp = DirectCast(req.GetResponse(), HttpWebResponse)
            StoredCookies.Add(resp.Cookies)
            stream_in = New StreamReader(resp.GetResponseStream())
            response = stream_in.ReadToEnd()
            stream_in.Close()
        Catch ex As Exception
        End Try
        Return response
    End Function


    Public Function LoadWebImageToPictureBox(ByVal pb As PictureBox, ByVal ImageURL As String, ByVal userAgent As String, ByVal referer As String) As Boolean
        Dim bAns As Boolean

        Try
            Dim resp As WebResponse
            Dim req As HttpWebRequest

            Dim sURL As String = Trim(ImageURL)

            If Not sURL.ToLower().StartsWith("http://") Then sURL = "http://" & sURL

            req = DirectCast(WebRequest.Create(sURL), HttpWebRequest)

            If userAgent = "" Then
                userAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
            End If
            req.UserAgent = userAgent
            req.Referer = referer
            req.AllowAutoRedirect = True
            req.ReadWriteTimeout = 5000
            req.CookieContainer = StoredCookies
            req.Headers.Set("Accept-Language", "en-us")

            req.KeepAlive = True
            req.Method = "GET"

            resp = req.GetResponse()
            If Not resp Is Nothing Then
                Dim remoteStream As Stream = resp.GetResponseStream()
                Dim objImage As New MemoryStream
                Dim bytesProcessed As Integer = 0
                Dim myBuffer As Byte()
                ReDim myBuffer(1024)
                Dim bytesRead As Integer
                bytesRead = remoteStream.Read(myBuffer, 0, 1024)
                Do While (bytesRead > 0)
                    objImage.Write(myBuffer, 0, bytesRead)
                    bytesProcessed += bytesRead
                    bytesRead = remoteStream.Read(myBuffer, 0, 1024)
                Loop
                pb.Image = Image.FromStream(objImage)
                bAns = True
                objImage.Close()
            End If
        Catch ex As Exception
            bAns = False
        End Try

        Return bAns
    End Function
End Class

EDIT: I figured out the problem it's this Google Javascript Clientside Obfuscated Encryption system at

http://www.google.com/js/th/1lOyLe_nzkTfeM2GpTkE65M1Lr8y0MC8hybXoEd-x1s.js

I still want to be able to defeat it without using a heavy webbrowser maybe some lightweight fast javascript evaluate control? No point in unobfuscating it and porting it over to VB.NET because as soon as I do it they might change a few variables or the encryption completely and I did all that work for nothing, so I want something that's more intelligent. At this point I don't even know how the URL is generated it does seem static for now and it's probably a real file not just in time generated file.

Turns out the _challenge page which gives the challenge for the image is just a decoy challenge.. that challenge then gets replaced (encrypted perhaps?) client-sided using variables t1, t2, t3, seems this encryption is not used each time, if you pass it once you get access to do what I am trying to do pretty much my code works but it stops working at very random intervals, I want something more solid which I can leave unattended for weeks.

like image 598
SSpoke Avatar asked Jul 23 '14 08:07

SSpoke


1 Answers

I had the same problem and found a solution, which will not give the easiest captchas but at least images which are a lot more easier. The result will be one readable word and one obscured.

I found that downloading "recaptcha/api/reload" is important to achieve that. Also maybe it makes a difference to add the "cachestop" paramater and maybe the referer.

data = UrlMgr("http://www.google.com/recaptcha/api/challenge?k=%s&cachestop=%.17f" % (id, random.random()), referer=referer, nocache=True).data
challenge = re.search("challenge : '(.*?)',", data).group(1)
server = re.search("server : '(.*?)',", data).group(1)
# this step is super important to get readable captchas - normally we could take the "c" from above and already retrieve a captcha but
# this one would be barely readable
reloadParams["c"] = challenge
reloadParams["k"] = id
reloadParams["lang"] = "de"
reloadParams["reason"] = "i"
reloadParams["type"] = "image"
data = UrlMgr("http://www.google.com/recaptcha/api/reload" , params=reloadParams, referer=referer, nocache=True).data
challenge = textextract(data, "Recaptcha.finish_reload('", "',")
return challenge, solveCaptcha(UrlMgr("%simage" % (server), params={"c":challenge}, referer=referer))

For further improvments my guess is that the "th" parameter is used to detect bots. It is generated by some complicated javascript which i myself didn't debug.

like image 119
balrok Avatar answered Oct 17 '22 08:10

balrok