Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Scraping in R

Tags:

r

I'd like to use the statistics from the Premier League's website for a class project. This is the website: https://www.premierleague.com/stats/top/players/goals

There are filters that allow us to filter by season and other factors, and a button at the bottom of the page that allows us to view the next 20 entries in the table.

My code is as follows:

library(tidyverse)
library(rvest)
url <- "https://www.premierleague.com/stats/top/players/goals?se=79"
url %>%
  read_html() %>% 
  html_nodes("table") %>% 
  .[[1]] %>% 
   html_table()

which outputs:

   Rank                  Player            Club         Nationality Stat
1     1            Alan Shearer               -             England  260
2     2            Wayne Rooney         Everton             England  208
3     3             Andrew Cole               -             England  187    
4     4           Frank Lampard               -             England  177
5     5           Thierry Henry               -              France  175
6     6           Robbie Fowler               -             England  163
7     7           Jermain Defoe AFC Bournemouth             England  162
8     8            Michael Owen               -             England  150
9     9           Les Ferdinand               -             England  149
10   10        Teddy Sheringham               -             England  146
11   11        Robin van Persie               -         Netherlands  144
12   12           Sergio Agüero Manchester City           Argentina  143
13   13 Jimmy Floyd Hasselbaink               -         Netherlands  127
14   14            Robbie Keane               -             Ireland  126
15   15          Nicolas Anelka               -              France  125
16   16            Dwight Yorke               - Trinidad And Tobago  123
17   17          Steven Gerrard               -             England  120
18   18              Ian Wright               -             England  113
19   19             Dion Dublin               -             England  111
20   20            Emile Heskey               -             England  110

However, when changing filters on the site (for instance, in my use case, restricting the table to the current season), and using the arrows to access the next 20 entries in the table, the URL does not change.

I have found the relevant areas of the source code. They are:

<div data-script="pl_stats" data-widget="stats-table" data-current-size="20" 
data-stat="" data-type="player" data-page-size="20" data-page="0" data- 
comps="1" data-num-entries="2162">

<div class="dropDown noLabel topStatsFilterDropdown" data-listener="true">
    <div data-metric="mins_played" class="current currentStatContainer" 
aria-expanded="false">Minutes played</div>
    <ul class="dropdownList" role="listbox">

I would like to be able to modify the data-metric and data-page fields.

like image 577
CFlaherty Avatar asked May 12 '18 20:05

CFlaherty


People also ask

Can you scrape data with R?

The commonly used web Scraping tools for R is rvest. Install the package rvest in your R Studio using the following code. Having, knowledge of HTML and CSS will be an added advantage. It's observed that most of the Data Scientists are not very familiar with technical knowledge of HTML and CSS.

Is web scraping easier in Python or R?

statsmodels in Python and other packages provide decent coverage for statistical methods, but the R ecosystem is far larger. It's usually more straightforward to do non-statistical tasks in Python. With well-maintained libraries like BeautifulSoup and requests, web scraping in Python is more straightforward than in R.

What is meant by data scraping?

Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program. Data scraping is commonly manifest in web scraping, the process of using an application to extract valuable information from a website.


2 Answers

This solution requires that you have access to a selenium server.

library(RSelenium) # not on cran (install with devtools::install_github("ropensci/RSelenium"))
library(rvest)

# helper functions ---------------------------

# click_el() solves the problem mentioned here:
# https://stackoverflow.com/questions/11908249/debugging-element-is-not-clickable-at-point-error
click_el <- function(rem_dr, el) {
  rem_dr$executeScript("arguments[0].click();", args = list(el))
}

# wrapper around findElement()
find_el <- function(rem_dr, xpath) {
  rem_dr$findElement("xpath", xpath)
}

# check if an element exists on the dom
el_exists <- function(rem_dr, xpath) {
  maybe_el <- read_html(rem_dr$getPageSource()[[1]]) %>%
    xml_find_first(xpath = xpath)
  length(maybe_el) != 0
}

# try to click on a element if it exists
click_if_exists <- function(rem_dr, xpath) {
  if (el_exists(rem_dr, xpath)) {
    suppressMessages({
      try({
        el <- find_el(rem_dr, xpath)
        el$clickElement()
      }, silent = TRUE
      )
    })
  }
}

# close google adds so they don't get in the way of clicking other elements
maybe_close_ads <- function(rem_dr) {
  click_if_exists(rem_dr, '//a[@id="advertClose" and @class="closeBtn"]')
}

# click on button that requires we accept cookies
maybe_accept_cookies <- function(rem_dr) {
  click_if_exists(rem_dr, "//div[@class='btn-primary cookies-notice-accept']")
}

# parse the data table you're interested in
get_tbl <- function(rem_dr) {
  read_html(rem_dr$getPageSource()[[1]]) %>% 
    html_nodes("table") %>% 
    .[[1]] %>% 
    html_table()
}

# actual execution ---------------------------

# first u need to start selenium server...i'm running the server inside a 
# docker container and having it listen on port 4445 on my local machine
# (see http://rpubs.com/johndharrison/RSelenium-Basics for more details):
`docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.1`

# connect to selenium server from within r
rem_dr <- remoteDriver(
  remoteServerAddr = "localhost", port = 4445L, browserName = "firefox"
)
rem_dr$open()

# go to webpage
rem_dr$navigate("https://www.premierleague.com/stats/top/players/goals")

# close adds
maybe_close_ads(rem_dr)
Sys.sleep(3)

# the seasons to iterate over
start <- 1992:2017 # u may want to replace this with `start <- 1992:1995` when testing
seasons <- paste0(start, "/", substr(start + 1, 3, 4))

# list to hold each season's data
out_list <- vector("list", length(seasons))
names(out_list) <- seasons

for (season in seasons) {

  maybe_close_ads(rem_dr)

  # to filter the data by season, we first need to click on the "filter by season" drop down
  # menu, so that the divs representing the various seasons become active (otherwise, 
  # we can't click them)
  cur_season <- find_el(
    rem_dr, '//div[@class="current" and @data-dropdown-current="FOOTBALL_COMPSEASON" and @role="button"]'
  )
  click_el(rem_dr, cur_season)
  Sys.sleep(3)

  # now we can select the season of interest
  xpath <- sprintf(
    '//ul[@data-dropdown-list="FOOTBALL_COMPSEASON"]/li[@data-option-name="%s"]', 
    season
  )
  season_lnk <- find_el(rem_dr, xpath)
  click_el(rem_dr, season_lnk)
  Sys.sleep(3)

  # parse the table shown on the first page
  tbl <- get_tbl(rem_dr)

  # iterate over all additional pages 
  nxt_page_act <- '//div[@class="paginationBtn paginationNextContainer"]'
  nxt_page_inact <- '//div[@class="paginationBtn paginationNextContainer inactive"]'

  while (!el_exists(rem_dr, nxt_page_inact)) {

    maybe_close_ads(rem_dr)
    maybe_accept_cookies(rem_dr)

    rem_dr$maxWindowSize()
    btn <- find_el(rem_dr, nxt_page_act)
    click_el(rem_dr, btn) # click "next button"

    maybe_accept_cookies(rem_dr)
    new_tbl <- get_tbl(rem_dr)
    tbl <- rbind(tbl, new_tbl)
    cat(".")
    Sys.sleep(2)
  }

  # put this season's data into the output list
  out_list[[season]] <- tbl
  print(season)
}

This takes a bit of time to run. I ended up with 6,731 rows of data total (across all seasons) when i ran it.

like image 104
Chris Avatar answered Sep 26 '22 05:09

Chris


Another way is to grab the resource directly.

In your browser, open Developer Tools (F12 in Chrome/Chromium), head to "Network", refresh (F5), and look for what looks like a nicely formatted JSON. When we've found it, we copy the link address and the headers (right-click on the resource > Copy link address, Copy request header) as well to impersonate the browser.

enter image description here

Here we need 2: season ids first, then player data:

library(httr)
library(purrr)
library(dplyr)

h <- add_headers(
  "Host" = "footballapi.pulselive.com",
  "Connection" = "keep-alive",
  "Origin" = "https://www.premierleague.com",
  "User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36",
  "Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
  "Accept" = "*/*",
  "DNT" = "1",
  "Referer" = "https://www.premierleague.com/stats/top/players/goals",
  "Accept-Encoding" = "gzip, deflate, br",
  "Accept-Language" = "fr,en-US;q=0.9,en;q=0.8"
)


r <- GET("https://footballapi.pulselive.com/football/competitions/1/compseasons?page=0&pageSize=100", h)
seasons <- content(r, type = "application/json")$content %>% transpose() %>% map(unlist) %>% {setNames(.$id, .$label)}

res <- seasons %>% 
  map(~ {
    url <- paste0("https://footballapi.pulselive.com/football/stats/ranked/players/goals",
                  "?page=0", "&pageSize=", "250", "&compSeasons=", .x, "&comps=1&compCodeForActivePlayer=EN_PR&altIds=true")
    # Be gentle
    Sys.sleep(0.5 + runif(1))
    r <- GET(url, h)
    jsonlite::flatten(jsonlite::fromJSON(content(r, as = "text"))$stats$content)
  }) %>%
  bind_rows(.id = "season")

glimpse(res)

# Observations: 6,460
# Variables: 35
# $ season                        <chr> "2017/18", "2017/18", "2017/18", "2017/18", "2017/18", "2017/18", "2017/18"...
# $ rank                          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 14, 14, 16, 16, 16, 16, 16, 16, ...
# $ name                          <chr> "goals", "goals", "goals", "goals", "goals", "goals", "goals", "goals", "go...
# $ value                         <dbl> 32, 30, 21, 20, 18, 16, 15, 14, 13, 12, 12, 12, 12, 11, 11, 10, 10, 10, 10,...
# $ description                   <chr> "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", ...
# $ owner.playerId                <dbl> 47184, 6566, 6009, 11259, 31712, 49685, 19511, 48542, 31068, 6288, 11264, 4...
# $ owner.active                  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...
# $ owner.age                     <chr> "25 years 336 days", "24 years 293 days", "29 years 349 days", "31 years 12...
# $ owner.id                      <dbl> 5178, 3960, 4328, 8979, 4316, 4290, 13511, 6899, 19680, 4503, 8983, 4772, 4...
# $ owner.info.position           <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "M", "M", "F", "M", "F", "F", ...
# $ owner.info.shirtNum           <dbl> 11, 32, 10, 9, NA, 9, 9, 9, 33, 10, 26, 17, 7, 7, 9, 14, 23, 19, 4, 10, 19,...
# $ owner.info.positionInfo       <chr> "Left/Centre/Right Second Striker", "Centre Striker", "Centre Striker", "Ce...
# $ owner.info.loan               <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
# $ owner.nationalTeam.isoCode    <chr> "EG", "GB-ENG", "AR", "GB-ENG", "GB-ENG", "BE", "BR", "FR", "BR", "BE", "DZ...
# $ owner.nationalTeam.country    <chr> "Egypt", "England", "Argentina", "England", "England", "Belgium", "Brazil",...
# $ owner.nationalTeam.demonym    <chr> "Egyptian", "English", NA, "English", "English", "Belgian", "Brazilian", "F...
# $ owner.currentTeam.name        <chr> "Liverpool", "Tottenham Hotspur", "Manchester City", "Leicester City", "Man...
# $ owner.currentTeam.teamType    <chr> "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FI...
# $ owner.currentTeam.shortName   <chr> "Liverpool", "Spurs", "Man City", "Leicester", "Man City", "Man Utd", "Live...
# $ owner.currentTeam.id          <dbl> 10, 21, 11, 26, 11, 12, 10, 1, 11, 4, 26, 131, 21, 25, 4, 1, 21, 10, 6, 7, ...
# $ owner.currentTeam.club.name   <chr> "Liverpool", "Tottenham Hotspur", "Manchester City", "Leicester City", "Man...
# $ owner.currentTeam.club.abbr   <chr> "LIV", "TOT", "MCI", "LEI", "MCI", "MUN", "LIV", "ARS", "MCI", "CHE", "LEI"...
# $ owner.currentTeam.club.id     <dbl> 10, 21, 11, 26, 11, 12, 10, 1, 11, 4, 26, 131, 21, 25, 4, 1, 21, 10, 6, 7, ...
# $ owner.currentTeam.altIds.opta <chr> "t14", "t6", "t43", "t13", "t43", "t1", "t14", "t3", "t43", "t8", "t13", "t...
# $ owner.birth.place             <chr> "Basyoun", NA, "Buenos Aires", NA, "Kingston", "Antwerp", "Maceio", "Lyon",...
# $ owner.birth.date.millis       <dbl> 708566400000, 743817600000, 581212800000, 537321600000, 786844800000, 73725...
# $ owner.birth.date.label        <chr> "15 June 1992", "28 July 1993", "2 June 1988", "11 January 1987", "8 Decemb...
# $ owner.birth.country.isoCode   <chr> "EG", "GB-ENG", "AR", "GB-ENG", "JM", "BE", "BR", "FR", "BR", "BE", "FR", "...
# $ owner.birth.country.country   <chr> "Egypt", "England", "Argentina", "England", "Jamaica", "Belgium", "Brazil",...
# $ owner.birth.country.demonym   <chr> "Egyptian", "English", NA, "English", "Jamaican", "Belgian", "Brazilian", "...
# $ owner.name.display            <chr> "Mohamed Salah", "Harry Kane", "Sergio Agüero", "Jamie Vardy", "Raheem Ster...
# $ owner.name.first              <chr> "Mohamed", "Harry", "Sergio", "Jamie", "Raheem", "Romelu", "Roberto Firmino...
# $ owner.name.last               <chr> "Salah Ghaly", "Kane", "Agüero", "Vardy", "Sterling", "Lukaku", "Barbosa de...
# $ owner.name.middle             <chr> NA, NA, NA, NA, "Shaquille", "Menama", NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
# $ owner.altIds.opta             <chr> "p118748", "p78830", "p37572", "p101668", "p103955", "p66749", "p92217", "p...

Edit
It seems you want the "All seasons" data, this is actually a little simpler:

url <- paste0("https://footballapi.pulselive.com/football/stats/ranked/players/goals?",
              "page=0&pageSize=", 1000, "&comps=1&compCodeForActivePlayer=EN_PR&altIds=true")
r <- GET(url, h)
dat <- jsonlite::flatten(jsonlite::fromJSON(content(r, as = "text"))$stats$content) 

# Observations: 1,000
# Variables: 34
# $ rank                          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, ...
# $ name                          <chr> "goals", "goals", "goals", "goals", "goals", "goals", "goals", "goals", "go...
# $ value                         <dbl> 260, 208, 187, 177, 175, 163, 162, 150, 149, 146, 144, 143, 127, 126, 125, ...
# $ description                   <chr> "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", ...
# $ owner.playerId                <dbl> 1440, 49682, 4848, 14004, 25936, 55008, 47174, 6476, 3754, 110022, 93894, 6...
# $ owner.active                  <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, ...
# $ owner.age                     <chr> "47 years 277 days", "32 years 205 days", "46 years 214 days", "39 years 33...
# $ owner.id                      <dbl> 89, 2064, 725, 800, 1659, 277, 1526, 1208, 462, 576, 2616, 4328, 1413, 1692...
# $ owner.info.position           <chr> "F", "F", "F", "M", NA, "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "...
# $ owner.info.shirtNum           <dbl> 9, 10, 20, 18, NA, 27, 18, 10, NA, NA, 32, 10, 18, 20, 39, 19, 8, NA, NA, 1...
# $ owner.info.positionInfo       <chr> "Forward", "Centre Second Striker", "Forward", "Centre Central Midfielder",...
# $ owner.info.loan               <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
# $ owner.nationalTeam.isoCode    <chr> "GB-ENG", "GB-ENG", "GB-ENG", "GB-ENG", "FR", "GB-ENG", "GB-ENG", "GB-ENG",...
# $ owner.nationalTeam.country    <chr> "England", "England", "England", "England", "France", "England", "England",...
# $ owner.nationalTeam.demonym    <chr> "English", "English", "English", "English", "French", "English", "English",...
# $ owner.birth.place             <chr> NA, "Liverpool", NA, "Romford", "Les Ulis", NA, "Beckton", NA, NA, NA, "Rot...
# $ owner.birth.date.millis       <dbl> 19353600000, 498960000000, 56332800000, 267148800000, 240624000000, 1662336...
# $ owner.birth.date.label        <chr> "13 August 1970", "24 October 1985", "15 October 1971", "20 June 1978", "17...
# $ owner.birth.country.isoCode   <chr> "GB-ENG", "GB-ENG", "GB-ENG", "GB-ENG", "FR", "GB-ENG", "GB-ENG", "GB-ENG",...
# $ owner.birth.country.country   <chr> "England", "England", "England", "England", "France", "England", "England",...
# $ owner.birth.country.demonym   <chr> "English", "English", "English", "English", "French", "English", "English",...
# $ owner.name.display            <chr> "Alan Shearer", "Wayne Rooney", "Andrew Cole", "Frank Lampard", "Thierry He...
# $ owner.name.first              <chr> "Alan", "Wayne", "Andrew", "Frank", "Thierry", "Robbie", "Jermain", "Michae...
# $ owner.name.last               <chr> "Shearer", "Rooney", "Cole", "Lampard", "Henry", "Fowler", "Defoe", "Owen",...
# $ owner.name.middle             <chr> NA, "Mark", NA, "James", NA, NA, "Colin", NA, NA, NA, NA, NA, NA, NA, NA, N...
# $ owner.altIds.opta             <chr> "p1", "p13017", "p1820", "p2051", "p1619", "p1794", "p7958", "p1795", "p190...
# $ owner.currentTeam.name        <chr> NA, "Everton", NA, "Manchester City", "New York Red Bulls", NA, "AFC Bourne...
# $ owner.currentTeam.teamType    <chr> NA, "FIRST", NA, "FIRST", "FIRST", NA, "FIRST", NA, NA, NA, "FIRST", "FIRST...
# $ owner.currentTeam.shortName   <chr> NA, "Everton", NA, "Man City", "New York Red Bulls", NA, "AFC Bournemouth",...
# $ owner.currentTeam.id          <dbl> NA, 7, NA, 11, 599, NA, 127, NA, NA, NA, 236, 11, NA, NA, NA, NA, NA, NA, N...
# $ owner.currentTeam.club.name   <chr> NA, "Everton", NA, "Manchester City", "New York Red Bulls", NA, "Bournemout...
# $ owner.currentTeam.club.abbr   <chr> NA, "EVE", NA, "MCI", "NY", NA, "BOU", NA, NA, NA, "FEY", "MCI", NA, NA, NA...
# $ owner.currentTeam.club.id     <dbl> NA, 7, NA, 11, 479, NA, 127, NA, NA, NA, 236, 11, NA, NA, NA, NA, NA, NA, N...
# $ owner.currentTeam.altIds.opta <chr> NA, "t11", NA, "t43", "t399", NA, "t91", NA, NA, NA, "t198", "t43", NA, NA,...

(I haven't looked at the data in detail, but at first sight, it seems this is not the same as simply aggregating:

res %>% 
  group_by(owner.playerId, owner.name.first, owner.name.last) %>% 
  summarise(total_goals = sum(value, na.rm = TRUE)) %>% 
  arrange(desc(total_goals))

# # A tibble: 433 x 4
# # Groups:   owner.playerId, owner.name.first [433]
#    owner.playerId owner.name.first owner.name.last     total_goals
#             <dbl> <chr>            <chr>                     <dbl>
#  1           6566 Harry            Kane                         84
#  2           6009 Sergio           Agüero                       65
#  3          49685 Romelu           Lukaku                       59
#  4          11259 Jamie            Vardy                        57
#  5          93889 Alexis           Sánchez                      46
#  6          14848 Bamidele         Alli                         37
#  7          19511 Roberto Firmino  Barbosa de Oliveira          36
#  8          11264 Riyad            Mahrez                       35
#  9          94221 Olivier          Giroud                       35
# 10          30549 Sadio            Mané                         34
# # ... with 423 more rows
like image 20
Aurèle Avatar answered Sep 26 '22 05:09

Aurèle