First of all, my problem is different from this one: Difference between cURL and web browser?
I use my Chrome browser to visit: http://www.walmart.com/search/browse-ng.do?cat_id=1115193_1071967 And then, I view the page source to get like:
<a class="js-product-title" href="/ip/Tide-Simply-Clean-Fresh-Refreshing-Breeze-Liquid-Laundry-Detergent-138-fl-oz/33963161">
However, I didn't find this kind of info from command line:
curl "http://www.walmart.com/search/browse-ng.do?cat_id=1115193_1071967">local.html
Does anyone know why cause the difference? I am using Python scrapy selector to parse the webpage.
You browser can execute JavaScript, which can in turn change the document. Curl will just give you the plain original output and nothing else.
If you turn off JavaScript in the browser and refresh the page, you will see that it looks differently.
In addition to just executing JS as explained in the other answer, your browser does a lot more work to fetch that page from the server that you are overlooking, and the server may be reacting based on that.
Paste this into notepad and take a look at what your browser sent to fetch that, vs the simple curl command you did.
curl "http://stackoverflow.com/questions/25333342/viewing-page-source-shows-different-html-than-curl" -H "Accept-Encoding: gzip,deflate,sdch" -H "Accept-Language: en-US,en;q=0.8" -H "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "Referer: http://stackoverflow.com/questions?page=2&sort=newest" -H "Cookie: <cookies redacted because lulz>" -H "Connection: keep-alive" -H "Cache-Control: max-age=0" --compressed
Things like the language header sent, and the user agent (more or less what browser and OS you are on), even in some cases if it was requested compressed can all cause a server to generate the page differently. This can be just normal reactions (like giving browser specific html to only that browser, cough*ie and opera*) or part of higher level A/B testing on new designs or functionality. Chances are, the content returned to you see at a URL may likely be different for someone else, or even to you using a different browser or tool.
I also have to point out that what you SEE on the page isnt what comes up with view source. The source is what was sent to your browser to render. What you actually see on the page is something after rendering and Javascript have executed. Most browser support some sort of "Inspect" function on the right click menu, I suggest you take a look at pages through that and compare to what shows in view source, It will change your perspective on how the web works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With