Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R shows different HTML (when compared to web browser) for the same Google Search URL

Tags:

html

r

rcurl

Goal

I would like to use R to download the HTML of a Google Search webpage as shown in a web browser.

Problem

When I download the Google Search webpage HTML in R, using the exact same URL from the web browser, I have noticed that the R downloaded HTML is different to the web browser HTML e.g. for an advanced Google Search URL the date parameter is ignored in the HTML read in by R whereas in the web browser it is kept.

Example

I do a Google Search in my web browser for "West End Theatre" and specify a date range of 1st January to 31st January 2012. I then copy the generated URL and paste it into R.

# Google Search URL from Firefox web browser
url <- "http://www.google.co.uk/search?q=west+end+theatre&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a#q=west+end+theatre&hl=en&client=firefox-a&hs=z7I&rls=org.mozilla:en-GB%3Aofficial&prmd=imvns&sa=X&ei=rJE7T8fwM82WhQe_6eD2CQ&ved=0CGoQpwUoBw&source=lnt&tbs=cdr:1%2Ccd_min%3A1%2F1%2F2012%2Ccd_max%3A31%2F1%2F2012&tbm=&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=6f92152f78004c6d&biw=1600&bih=810"
u <- URLdecode(url)

# Webpage as seen in browser
browseURL(u)

# Webpage as seen from R
HTML <- paste(readLines(u), collapse = "\n")
cat(HTML, file = "output01.html")
shell.exec("output01.html")

# Webpage as seen from R through RCurl
library(RCurl)
cookie = 'cookiefile.txt'
curl = getCurlHandle(cookiefile = cookie,
                     useragent =  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
                     header = FALSE,
                     verbose = TRUE,
                     netrc = TRUE,
                     maxredirs = as.integer(20),
                     followlocation = TRUE,
                     ssl.verifypeer = TRUE,
                     cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
HTML2 <- getURL(u, curl = curl)
cat(HTML2, file = "output02.html")
shell.exec("output02.html")

By running the self-contained code above I can see that the first web page which opens is what I want (with the date parameter enforced) but the second and third webpages which open (as downloaded through R) have the date parameter ignored.

Question

How can I download the HTML for the first webpage which opens instead of the second/third webpages?

System Information

> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.6-10.1 bitops_1.0-4.1

loaded via a namespace (and not attached):
[1] tools_2.14.0
like image 782
Tony Breyal Avatar asked Feb 15 '12 11:02

Tony Breyal


2 Answers

Part of your problem is that Google has profiled you and is returning matches based on what it knows from your previous searches, gmail discussions, google maps use, IP address, location data, ads viewed, social contacts and other services. Some of this happens even if you don't have a google account.

Signed-in personalization: When you’re signed in to a Google Account with Web History, Google personalizes your search experience based on what you’ve searched for and which sites you’ve visited in the past.

Signed-out personalization: When you’re not signed in, Google customizes your search experience based on past search information linked to your browser, using a cookie. Google stores up to 180 days of signed-out search activity linked to your browser’s cookie, including queries and results you click.

The only way to make your automated results match your manual one is to try and match your profile. At the very least you should try sending the same User-Agent string as your browser and the same cookies. You can find out what these are by sniffing your HTTP requests on the network or using a browser addon like Live HTTP Headers.

As for why the date is being filtered I think jbaums comment covers that. There's some stuff going on client-side that handles filtering and results-while-you-type. There may be a way around this though if you can trigger googles old interface before the AJAX stuff was added. See what you get from Google in your browser if you disable Javascript.

like image 101
SpliFF Avatar answered Nov 10 '22 12:11

SpliFF


Instead of trying to decode the results of Google's search pages, you can just use the Custom Search API. After getting an API key, you will be able to specify your search criteria through the URL, and receive a JSON file instead of having to decode the HTML. The rjson package will help you to read the JSON file into an R object, and extract the relevant data.

You will be limited to a 1000 queries a day, but it might be much easier to work with.

EDIT: Notably, the Custom Search API has been deprecated.

like image 22
nograpes Avatar answered Nov 10 '22 10:11

nograpes