Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using r to navigate and scrape a webpage with drop down html forms

I'm attempting to scrape data from http://www.footballoutsiders.com/stats/snapcounts, but I can't change the fields in the drop down boxes on the site ("team", "week", "position", and "year"). My attempt to scrape the table associated with team = "ALL", week= "1", pos = "All", and year= "2015" with rvest is below.

url <- "http://www.footballoutsiders.com/stats/snapcounts"
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[3]]
filled_form <-set_values(pgform,
            "team" = "ALL",
            "week" = "1",
            "pos"  = "ALL",
            "year" = "2015"             
 )

 submit_form(session=pgsession,form=filled_form, POST=url)

 y <- read_html("http://www.footballoutsiders.com/stats/snapcounts")

 y <- y %>%
    html_nodes("table") %>%
    .[[2]] %>%
    html_table(header=TRUE)

This code returns the table associated the default variables in the dropdown box which are team = "ALL", week= "20", pos = "QB", and year= "2015" which is a data frame that only contains 11 observations. If it had actually changed the fields it would have returned a data frame with 1,695 observations.

like image 791
John Avatar asked Oct 18 '22 06:10

John


1 Answers

You can capture the session produced when the form is submitted and use that session as input to html_nodes:

d <- submit_form(session=pgsession, form=filled_form)

y <- d %>%
    html_nodes("table") %>%
    .[[2]] %>%
    html_table(header=TRUE)

dim(y)
#[1] 1695   11

Otherwise, if you use read_html(url) you are reading the original page.

like image 149
Jota Avatar answered Oct 28 '22 21:10

Jota