Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does cURL give correct response but scrapy does not?

Why does cURL give correct response but scrapy does not?

The site I want to scrape using javascript to fill in a form then POSTs it and verifies before serving the content.

I've replicated this js in python, after scraping the parameters from the javascript in the initial GET request. My value of "TS644333_75" matches the js value (as tested by doing a document.write(..) out, instead of letting it submit like normal), and if you copy and paste the result into cURL that works too. For example:

curl  --http1.0 'http://www.betvictor.com/sports/en/football' -H 'Connection: keep-alive'
 -H  'Accept-Encoding: gzip,deflate' -H 'Accept-Language: en' 
 -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' 
 -H 'Referer: http://www.betvictor.com/sports/en/football' -H 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0' 
--data 
'TS644333_id=3&
 TS644333_75=286800b2a80cd3334cd2895e42e67031%3Ayxxy%3A3N6QfX3q%3A1685704694&
 TS644333_md=1&
 TS644333_rf=0&
 TS644333_ct=0&
 TS644333_pd=0' --compressed

Where to get TS644333_75 I've simply copy and pasted the result my python code calculated when simulating the js.

Monitoring packets in wireshark shows this POST as shown here (I've added some line spaces to make the POST data more readable, but otherwise it's as seen in wireshark).

However if I start a scrapy shell:

1) scrapy shell "http://www.betvictor.com/sports/en/football"

and construct a form request:

2) from scrapy.http import FormRequest

   req=FormRequest(
    url='http://www.betvictor.com/sports/en/football',
    formdata={
              'TS644333_id': '3',
              'TS644333_75': '286800b2a80cd3334cd2895e42e67031:yxxy:3N6QfX3q:1685704694',
              'TS644333_md': '1',
              'TS644333_rf': '0',
              'TS644333_ct': '0',
              'TS644333_pd': '0'
    },
    headers={
    'Referer': 'http://www.betvictor.com/sports/en/football',
    'Connection': 'keep-alive'
   }
   )

Then fetch it

3) fetch(req)

The response body I get back is just another javascript challenge, not the served up content desired.

Yet the packet seen in wireshark is (again with some newlines for readability in POST params) shown here, and to my eyes looks indentical .

What is going wrong? How can packets that appear identical lead to different server responses? Why is this not working with scrapy?

It could be the encoding of the ":" in the parameter computed that I POST, but it looks to have been encoded correctly, and both match in wireshark, so I can't see that as the issue.

like image 660
fpghost Avatar asked Mar 20 '14 11:03

fpghost


1 Answers

It seems to work if you append a slash to your URL - so same scrapy request, but with URL changed to:

http://www.betvictor.com/sports/en/football/

Additional Example:

I had the same problem when testing another website where the page worked on curl nicely, but did not work with requests. After fighting with it for sometime, this answer with extra slash solved the problem.

import requests
import json


r = requests.get(r'https://bet.hkjc.com/marksix/getJSON.aspx/?sd=20190101&ed=20190331&sb=0')

pretty_json = json.loads(r.text)
print (json.dumps(pretty_json, indent=2))

returns this:

[
  {
    "id": "19/037",
       "date": "30/03/2019",
        "no": "15+17+18+37+39+49",
        "sno": "31",
        "sbcode": "",
...
...

The slash after .aspx is important. It doesn't work without it. Without the slash, the page returns an empty javascript challenge.

import requests
import json

#no slash
    r = requests.get(r'https://bet.hkjc.com/marksix/getJSON.aspx?sd=20190101&ed=20190331&sb=0')

    print(r.text)

returns this:

<HTML>
<head>
<script>
Challenge=341316;
ChallengeId=49424326;
GenericErrorMessageCookies="Cookies must be enabled in order to view this page.";
</script>
<script>
function test(var1)
{
    var var_str=""+Challenge;
    var var_arr=var_str.split("");
    var LastDig=var_arr.reverse()[0];
    var minDig=var_arr.sort()[0];
    var subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);
    var subvar2 = (2 * var_arr[2])+var_arr[1];
    var my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);
    var x=(var1*3+subvar1)*1;
    var y=Math.cos(Math.PI*subvar2);
    var answer=x*y;
    answer-=my_pow*1;
    answer+=(minDig*1)-(LastDig*1);
    answer=answer+subvar2;
    return answer;
}
</script>
<script>
client = null;
if (window.XMLHttpRequest)
{
    var client=new XMLHttpRequest();
}
else
{
    if (window.ActiveXObject)
    {
        client = new ActiveXObject('MSXML2.XMLHTTP.3.0');
    };
}
if (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!![].sort)&&(!![].reverse)))
{
    document.write("Not all needed JavaScript methods are supported.<BR>");

}
else
{
    client.onreadystatechange  = function()
    {
        if(client.readyState  == 4)
        {
            var MyCookie=client.getResponseHeader("X-AA-Cookie-Value");
            if ((MyCookie == null) || (MyCookie==""))
            {
                document.write(client.responseText);
                return;
            }

            var cookieName = MyCookie.split('=')[0];
            if (document.cookie.indexOf(cookieName)==-1)
            {
                document.write(GenericErrorMessageCookies);
                return;
            }
            window.location.reload(true);
        }
    };
    y=test(Challenge);
    client.open("POST",window.location,true);
    client.setRequestHeader('X-AA-Challenge-ID', ChallengeId);
    client.setRequestHeader('X-AA-Challenge-Result',y);
    client.setRequestHeader('X-AA-Challenge',Challenge);
    client.setRequestHeader('Content-Type' , 'text/plain');
    client.send();
}
</script>
</head>
<body>
<noscript>JavaScript must be enabled in order to view this page.</noscript>
</body>
</HTML>
like image 170
anana Avatar answered Sep 30 '22 11:09

anana