This question is based on another that I saw closed which generated curiosity as I learned something new about Google Chrome's Inspect Element to create the HTML parsing path for <code>XML::getNodeSet</code>. While this question was closed as I think it may have been too broad I'll ask a smaller more focused question that may get at the root of the problem. I tried to help the poster by writing code I typically use for scraping but ran into a wall immediately as the poster wanted elements from Google Chrome's Inspect Element. This is not the same as the HTML from <code>htmlTreeParse</code> as demonstrated here: <pre class="prettyprint"><code>url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969" doc <- htmlTreeParse(url, useInternalNodes = TRUE) m <- capture.output(doc) any(grepl("258.12", m)) ## FALSE </code></pre> But here in Google Chrome's Inspect Element we can see that this information is provided (in yellow): <img src="https://i.stack.imgur.com/khnEk.png" alt="enter image description here"> How can we get the information from Google Chrome's Inspect Element into R? The poster could obviously copy and paste the code into a text editor and parse that way but they are looking to scrape and thus that workflow does not scale. Once the poster can get this info into R they can then use typical HTML parsing techniques (<code>XLM</code> and <code>RCurl</code>-fu).

You should be able to scrape the page using something like the following code for RSelenium. You need to have java installed and available on your path for the <code>startServer()</code> line to work (and thus for you to be able to do anything). <pre class="prettyprint"><code>library("RSelenium") checkForServer() startServer() remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444, browserName = "firefox" ) url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969" remDr$open() remDr$navigate(url) source <- remDr$getPageSource()[[1]] </code></pre> Check to make sure it worked according to your test: <pre class="prettyprint"><code>> grepl("258.12", source) [1] TRUE </code></pre>

Get Google Chrome's Inspect Element into R

Tags:

r

This question is based on another that I saw closed which generated curiosity as I learned something new about Google Chrome's Inspect Element to create the HTML parsing path for XML::getNodeSet. While this question was closed as I think it may have been too broad I'll ask a smaller more focused question that may get at the root of the problem.

I tried to help the poster by writing code I typically use for scraping but ran into a wall immediately as the poster wanted elements from Google Chrome's Inspect Element. This is not the same as the HTML from htmlTreeParse as demonstrated here:

url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
doc <- htmlTreeParse(url, useInternalNodes = TRUE) 
m <- capture.output(doc)
any(grepl("258.12", m))
## FALSE

But here in Google Chrome's Inspect Element we can see that this information is provided (in yellow):

enter image description here

How can we get the information from Google Chrome's Inspect Element into R? The poster could obviously copy and paste the code into a text editor and parse that way but they are looking to scrape and thus that workflow does not scale. Once the poster can get this info into R they can then use typical HTML parsing techniques (XLM and RCurl-fu).

577

asked Aug 04 '14 13:08

Tyler Rinker

1 Answers

You should be able to scrape the page using something like the following code for RSelenium. You need to have java installed and available on your path for the startServer() line to work (and thus for you to be able to do anything).

library("RSelenium")
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost", 
                      port = 4444, 
                      browserName = "firefox"
                      )
url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
remDr$open()
remDr$navigate(url)
source <- remDr$getPageSource()[[1]]

Check to make sure it worked according to your test:

> grepl("258.12", source)
[1] TRUE

143

answered Oct 11 '22 13:10

Thomas

Related questions
                            
                                r language support for AWS DynamoDB [duplicate]
                            
                                Adding a legend to scatter3d plot
                            
                                animation in knitr document with ggplot figures
                            
                                ggplot2: faceting on a function of column
                            
                                parallel computations on Reference Classes
                            
                                Resize/manually enter breaks on colorbar guide of geom_tile AND replace y-axis labels
                            
                                R shiny app with inputs depending on updated data
                            
                                R foreach issue (some processes returning NULL)
                            
                                How can I add symbols in slider labels?
                            
                                Reproduce well-log plot with ggplot?
                            
                                Transform from class "simple_triplet_matrix" to class "matrix"
                            
                                Bug in R align.time/aggregate?
                            
                                Get facebook public page rating and review
                            
                                Speeding up wilcox.test in R
                            
                                How to hide selected correlations for corrplot?
                            
                                How to identify fully connected node clusters with igraph?
                            
                                data.table::fread doesn't like missing values in first column
                            
                                unsupervised semantic clustering of phrases
                            
                                ggplot2 specify point size in axis units
                            
                                RStudio server - Hangs when switching projects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With