Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get table using rvest()

I want to grab some data from Pro Football Reference website using the rvest package. First, let's grab results for all games played in 2015 from this url http://www.pro-football-reference.com/years/2015/games.htm

library("rvest")
library("dplyr")

#grab table info
url <- "http://www.pro-football-reference.com/years/2015/games.htm"
urlHtml <- url %>% read_html() 
dat <- urlHtml %>% html_table(header=TRUE) %>% .[[1]] %>% as_data_frame()

Is that how you would have done it? :)

dat could be cleaned up a bit. Two of the variables seem to have blanks for names. Plus the header row is repeated between each week.

colnames(dat) <- c("week", "day", "date", "winner", "at", "loser", 
                   "box", "ptsW", "ptsL", "ydsW", "toW", "ydsL", "toL")

dat2 <- dat %>% filter(!(box == ""))
head(dat2)

Looks good!

Now let's look at an individual game. At the webpage above, click on "Boxscore" in the very first row of the table: The Sept 10th game played between New England and Pittsburgh. That takes us here: http://www.pro-football-reference.com/boxscores/201509100nwe.htm.

I want to grab the individual snap counts for each player (about half way down the page). Pretty sure these will be our first two lines of code:

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()

But now I can't figure out how to grab the specific table I want. I use the Selector Gadget to highlight the table of Patriots snap counts. I do this by clicking on the table in several places, then 'unclicking' the other tables that were highlighted. I end up with a path of:

#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left

Each of these attempts returns {xml_nodeset (0)}

gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left, #home_snap_counts .left, #home_snap_counts .tooltip, #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right , #home_snap_counts .left")
gameHtml %>% html_nodes("#home_snap_counts .right")
gameHtml %>% html_nodes("#home_snap_counts")

Maybe let's try using xpath. All of these attempts also return {xml_nodeset (0)}

gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))] | //*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "tooltip", " " ))]//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ), concat( " ", "left", " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]//*[contains(concat( " ", @class, " " ))]')
gameHtml %>% html_nodes(xpath = '//*[(@id = "home_snap_counts")]')

How can I grab that table? I'll also point out, when I do "View Page Source" in Google Chrome, the tables I want almost seem to be commented out? That is, they're typed in green, instead of the usual red/black/blue color scheme. That is not the case for the table of game results we pulled first. "View Page Source" for that table is the usual red/black/blue color scheme. Is the greenness indicative of what's preventing me from being able to grab this snap count table?

Thanks!

like image 859
hossibley Avatar asked Aug 30 '16 16:08

hossibley


1 Answers

The information you are looking for is programmatically display at run time. One solution is to use RSelenium.

While looking at the web page's source, the information from the tables are stored in the code but are hidden because the tables are stored as comments. Here is my solution where I remove the comments markers and reprocess the page normally.

I saved the file to the working directory and then read the file in using the readLines function.
Now I search for the html’s begin and end comment flags and then remove them. I save the file a second time (less the comment flags) in order to reread and process the file for the selected tables.

gameUrl <- "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
gameHtml <- gameUrl %>% read_html()
gameHtml %>% html_nodes("tbody")

#Only save and work with the body
body<-html_node(gameHtml,"body")
write_xml(body, "nfl.xml")

#Find and remove comments
lines<-readLines("nfl.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "nfl2.xml")

#Read the file back in and process normally
body<-read_html("nfl2.xml")
html_table(html_nodes(body, "table")[29])

#extract the attributes and find the attribute of interest
a<-html_attrs(html_nodes(body, "table"))

#find the tables of interest.
homesnap<-which(sapply(a, function(x){x[2]})=="home_snap_counts")
html_table(html_nodes(body, "table")[homesnap])

visitsnap<-which(sapply(a, function(x){x[2]})=="vis_snap_counts")
html_table(html_nodes(body, "table")[visitsnap])
like image 117
Dave2e Avatar answered Oct 22 '22 12:10

Dave2e