Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Viewing "Page Source" shows different HTML than cURL

First of all, my problem is different from this one: Difference between cURL and web browser?

I use my Chrome browser to visit: http://www.walmart.com/search/browse-ng.do?cat_id=1115193_1071967 And then, I view the page source to get like:

<a class="js-product-title" href="/ip/Tide-Simply-Clean-Fresh-Refreshing-Breeze-Liquid-Laundry-Detergent-138-fl-oz/33963161">

However, I didn't find this kind of info from command line:

curl "http://www.walmart.com/search/browse-ng.do?cat_id=1115193_1071967">local.html

Does anyone know why cause the difference? I am using Python scrapy selector to parse the webpage.

like image 668
Patrick Avatar asked Aug 15 '14 19:08

Patrick


2 Answers

You browser can execute JavaScript, which can in turn change the document. Curl will just give you the plain original output and nothing else.

If you turn off JavaScript in the browser and refresh the page, you will see that it looks differently.

like image 72
GolezTrol Avatar answered Nov 20 '22 12:11

GolezTrol


In addition to just executing JS as explained in the other answer, your browser does a lot more work to fetch that page from the server that you are overlooking, and the server may be reacting based on that.

  • Open Chrome, Press F12, Go to the "Network" Tab.
  • Load the page you want to.
  • Look for the very first thing that got requested (It should be a document icon, with the url below it, you can also sort by 'Timeline' to find it too)
  • Right click on the item, choose 'Copy as cURL'

Paste this into notepad and take a look at what your browser sent to fetch that, vs the simple curl command you did.

curl "http://stackoverflow.com/questions/25333342/viewing-page-source-shows-different-html-than-curl" -H "Accept-Encoding: gzip,deflate,sdch" -H "Accept-Language: en-US,en;q=0.8" -H "User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" -H "Referer: http://stackoverflow.com/questions?page=2&sort=newest" -H "Cookie: <cookies redacted because lulz>" -H "Connection: keep-alive" -H "Cache-Control: max-age=0" --compressed

Things like the language header sent, and the user agent (more or less what browser and OS you are on), even in some cases if it was requested compressed can all cause a server to generate the page differently. This can be just normal reactions (like giving browser specific html to only that browser, cough*ie and opera*) or part of higher level A/B testing on new designs or functionality. Chances are, the content returned to you see at a URL may likely be different for someone else, or even to you using a different browser or tool.

I also have to point out that what you SEE on the page isnt what comes up with view source. The source is what was sent to your browser to render. What you actually see on the page is something after rendering and Javascript have executed. Most browser support some sort of "Inspect" function on the right click menu, I suggest you take a look at pages through that and compare to what shows in view source, It will change your perspective on how the web works.

like image 5
Uberfuzzy Avatar answered Nov 20 '22 14:11

Uberfuzzy