I am trying to make a data frame by scraping from the web. But there are multiple pages that make up the table I am trying to scrape. same link, but page is different.
for the first page, this is how I would scrape it:
library(XML)
CB.13<- "http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p=1&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"
CB.13<- readHTMLTable(CB.13, header=FALSE)
cornerback.function<- function(CB.13){
first<- "1"
last<- "1"
for (i in 1:length(CB.13)){
lastrow<- nrow(CB.13[[i]])
lastcol<- ncol(CB.13[[i]])
if(as.numeric(CB.13[[i]] [1,1]) ==first & as.numeric(CB.13[[i]] [lastrow, lastcol]) ==last) {
tab <- i
}
}
}
cornerback.function(CB.13)
cornerbacks.2013<- CB.13[[tab]]
cb.names<- c("Rk", "name", "Team", "Pos", "Comb", "Total", "Ast", "Sck", "SFTY", "PDef", "Int", "TDs", "Yds", "Lng", "FF", "Rec", "TD")
names(cornerbacks.2013)<- cb.names
I need to do this for multiple years, all with multiple pages- so is there a quicker way to get all of the pages of the data instead of having to do this for each individual page of the table and merge them? the next link would be http://www.nfl.com/stats/categorystats?tabSeq=1&season=2013&seasonType=REG&Submit=Go&experience=&archive=false&conference=null&d-447263-p=2&statisticPositionCategory=DEFENSIVE_BACK&qualified=true
and there are 8 pages for this year- maybe a for loop to loop through pages?
You can dynamically create the url using paste0
since that they slightly differ. For a certain year, you change just the page number. You get an url structure like:
url <- paste0(url1,year,url2,page,url3) ## you change page or year or both
You can create a function to loop over different page, and return a table. Then bind them using the classic do.call(rbind,..)
:
library(XML)
url1 <- "http://www.nfl.com/stats/categorystats?tabSeq=1&season="
year <- 2013
url2 <- "&seasonType=REG&experience=&Submit=Go&archive=false&conference=null&d-447263-p="
page <- 1
url3 <- "&statisticPositionCategory=DEFENSIVE_BACK&qualified=true"
getTable <-
function(page=1,year=2013){
url <- paste0(url1,year,url2,page,url3)
tab = readHTMLTable(url,header=FALSE) ## see comment !!
tab$result
}
## this will merge all tables in a single big table
do.call(rbind,lapply(seq_len(8),getTable,year=2013))
The general method is to scrape the next page url using some xpath tag and loop till to not have any new next page. This is can be more difficult to do but it is the cleanest solution.
getNext <-
function(url=url_base){
doc <- htmlParse(url)
XPATH_NEXT = "//*[@class='linkNavigation floatRight']/*[contains(., 'next')]"
next_page <- unique(xpathSApply(doc,XPATH_NEXT,xmlGetAttr,'href'))
if(length(next_page)>0)
paste0("http://www.nfl.com",next_page)
else ''
}
## url_base is your first url
res <- list()
while(TRUE){
tab = readHTMLTable(url_base,header=FALSE)
res <- rbind(res,tab$result)
url_base <- getNext(url_base)
if (nchar(url_base)==0)
break
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With