Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

change dropdown menu then scrape data using rvest or httr

Tags:

r

I'm scraping data from https://rotogrinders.com/lineups/nfl?site=draftkings. Currently, I use myData <- read_html("https://rotogrinders.com/lineups/nfl?site=draftkings") to bring in data, then extract the data I want using html_nodes. I'm trying to change the slate selection menu, then grab the data. The XPath for the menu I'm trying to change is //select[@name='slate_name'].

My research leads me to believe I need to implement one of the following functions, but I'm unsure how to go about doing it, as the menu is not in a form and there is no submit button... the page automatically reloads once a new option is selected:

httr::post
rvest::html_session
Rselenium

I'm not familiar with the Rselenium library, so ideally I'm looking for a solution using httr or rvest.

like image 809
orangeman51 Avatar asked Sep 15 '18 22:09

orangeman51


People also ask

What does rvest do in r?

Overview. rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.

Can you scrape data with r?

The commonly used web Scraping tools for R is rvest. Install the package rvest in your R Studio using the following code. Having, knowledge of HTML and CSS will be an added advantage. It's observed that most of the Data Scientists are not very familiar with technical knowledge of HTML and CSS.


1 Answers

You allready got all the information with via read_html(). The slate-name drop-down just filter the schedules via java-script. I would suggest grab all the data and filter by yourself. Hope that helps.

library(magrittr)
library(rvest)
#> Lade nötiges Paket: xml2

url <- "https://rotogrinders.com/lineups/nfl?site=draftkings"
myData <- read_html(url) 

myData %>%
  html_nodes(".teams") %>%
  html_text() %>%
  stringr::str_squish()
#>  [1] "New York NYJ Jets Cleveland CLE Browns"          
#>  [2] "New Orleans NOS Saints Atlanta ATL Falcons"      
#>  [3] "Buffalo BUF Bills Minnesota MIN Vikings"         
#>  [4] "Denver DEN Broncos Baltimore BAL Ravens"         
#>  [5] "Indianapolis IND Colts Philadelphia PHI Eagles"  
#>  [6] "Cincinnati CIN Bengals Carolina CAR Panthers"    
#>  [7] "San Francisco SFO 49ers Kansas City KCC Chiefs"  
#>  [8] "Green Bay GBP Packers Washington WAS Redskins"   
#>  [9] "Oakland OAK Raiders Miami MIA Dolphins"          
#> [10] "New York NYG Giants Houston HOU Texans"          
#> [11] "Tennessee TEN Titans Jacksonville JAC Jaguars"   
#> [12] "Los Angeles LAC Chargers Los Angeles LAR Rams"   
#> [13] "Chicago CHI Bears Arizona ARI Cardinals"         
#> [14] "Dallas DAL Cowboys Seattle SEA Seahawks"         
#> [15] "New England NEP Patriots Detroit DET Lions"      
#> [16] "Pittsburgh PIT Steelers Tampa Bay TBB Buccaneers"

Created on 2018-09-22 by the reprex package (v0.2.1)

EDIT You still got all relevant information via read_html(). You need to get the id from the drop-down and then parse the java-script string with all the salaries. I did the first part, the rest is up to you ;-)

library(tidyverse, quietly = TRUE)
library(rvest, warn.conflicts = FALSE)
#> Lade nötiges Paket: xml2

url <- "https://rotogrinders.com/lineups/nfl?site=draftkings"
raw <- read_html(url) 

# helper function
parse_json <- function(x) tibble(name = x$name, importID = x$importId)

# get id from slates
raw %>%
  html_nodes(".slate-data") %>%
  html_attr(name = "value") %>%
  jsonlite::fromJSON() %>%
  purrr::map_df(parse_json)
#> # A tibble: 10 x 2
#>    name                                                importID
#>    <chr>                                               <chr>   
#>  1 1:00pm: Classic: 13 Games                           21505   
#>  2 8:20pm: Classic (Thu-Mon): 16 Games                 21576   
#>  3 1:00pm: Classic (Sun-Mon): 15 Games                 21586   
#>  4 1:00pm: Tiers (NFL Tiers): 14 Games                 21589   
#>  5 1:00pm: Classic (Early Only): 10 Games              21581   
#>  6 4:05pm: Classic (Afternoon Only): 3 Games           21630   
#>  7 4:25pm: Classic (Afternoon Turbo): 2 Games          21631   
#>  8 8:20pm: Classic (Primetime): 2 Games                21645   
#>  9 4:25pm: Showdown Captain Mode (DAL vs SEA): 1 Games 21632   
#> 10 8:20pm: Showdown Captain Mode (NE vs DET): 1 Games  21644

raw %>%
  html_nodes(".select") %>%
  html_nodes("script") %>%
  html_text() %>%
  stringr::str_squish() %>%
  substr(1, 1000)
#> [1] "window.slateSelect = window.createReactComponent(SlateSelectRadnor, { slates: {\"All Games\":{\"games\":[{\"scheduleId\":\"45755\",\"teamAwayId\":\"12\",\"teamHomeId\":\"3\"},{\"scheduleId\":\"45756\",\"teamAwayId\":\"23\",\"teamHomeId\":\"21\"},{\"scheduleId\":\"45757\",\"teamAwayId\":\"9\",\"teamHomeId\":\"8\"},{\"scheduleId\":\"45758\",\"teamAwayId\":\"25\",\"teamHomeId\":\"1\"},{\"scheduleId\":\"45759\",\"teamAwayId\":\"14\",\"teamHomeId\":\"19\"},{\"scheduleId\":\"45760\",\"teamAwayId\":\"2\",\"teamHomeId\":\"22\"},{\"scheduleId\":\"45761\",\"teamAwayId\":\"31\",\"teamHomeId\":\"26\"},{\"scheduleId\":\"45762\",\"teamAwayId\":\"7\",\"teamHomeId\":\"20\"},{\"scheduleId\":\"45763\",\"teamAwayId\":\"27\",\"teamHomeId\":\"10\"},{\"scheduleId\":\"45764\",\"teamAwayId\":\"18\",\"teamHomeId\":\"13\"},{\"scheduleId\":\"45765\",\"teamAwayId\":\"16\",\"teamHomeId\":\"15\"},{\"scheduleId\":\"45766\",\"teamAwayId\":\"28\",\"teamHomeId\":\"30\"},{\"scheduleId\":\"45767\",\"teamAwayId\":\"5\",\"teamHomeId\":\"29\"},{\"scheduleId\":\"45768\",\"teamAwayId\":\"17\",\"teamHomeId\":\"32\"},{\"scheduleId\":\"45769\",\"teamAwayId\":\"11\",\"teamHomeId\":\"6\"},{\"scheduleId\":\"45770\","

Created on 2018-09-23 by the reprex package (v0.2.1)

like image 135
Birger Avatar answered Oct 21 '22 07:10

Birger