Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get Google Chrome's Inspect Element into R

Tags:

r

This question is based on another that I saw closed which generated curiosity as I learned something new about Google Chrome's Inspect Element to create the HTML parsing path for XML::getNodeSet. While this question was closed as I think it may have been too broad I'll ask a smaller more focused question that may get at the root of the problem.

I tried to help the poster by writing code I typically use for scraping but ran into a wall immediately as the poster wanted elements from Google Chrome's Inspect Element. This is not the same as the HTML from htmlTreeParse as demonstrated here:

url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
doc <- htmlTreeParse(url, useInternalNodes = TRUE) 
m <- capture.output(doc)
any(grepl("258.12", m))
## FALSE

But here in Google Chrome's Inspect Element we can see that this information is provided (in yellow):

enter image description here

How can we get the information from Google Chrome's Inspect Element into R? The poster could obviously copy and paste the code into a text editor and parse that way but they are looking to scrape and thus that workflow does not scale. Once the poster can get this info into R they can then use typical HTML parsing techniques (XLM and RCurl-fu).

like image 577
Tyler Rinker Avatar asked Aug 04 '14 13:08

Tyler Rinker


People also ask

How do I copy Inspect Element from Chrome?

You can copy by inspect element and target the div you want to copy. Just press ctrl+c and then your div will be copy and paste in your code it will run easily.


1 Answers

You should be able to scrape the page using something like the following code for RSelenium. You need to have java installed and available on your path for the startServer() line to work (and thus for you to be able to do anything).

library("RSelenium")
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost", 
                      port = 4444, 
                      browserName = "firefox"
                      )
url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
remDr$open()
remDr$navigate(url)
source <- remDr$getPageSource()[[1]]

Check to make sure it worked according to your test:

> grepl("258.12", source)
[1] TRUE
like image 143
Thomas Avatar answered Oct 11 '22 13:10

Thomas