I'd like to use the statistics from the Premier League's website for a class project. This is the website: https://www.premierleague.com/stats/top/players/goals
There are filters that allow us to filter by season and other factors, and a button at the bottom of the page that allows us to view the next 20 entries in the table.
My code is as follows:
library(tidyverse)
library(rvest)
url <- "https://www.premierleague.com/stats/top/players/goals?se=79"
url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
which outputs:
Rank Player Club Nationality Stat
1 1 Alan Shearer - England 260
2 2 Wayne Rooney Everton England 208
3 3 Andrew Cole - England 187
4 4 Frank Lampard - England 177
5 5 Thierry Henry - France 175
6 6 Robbie Fowler - England 163
7 7 Jermain Defoe AFC Bournemouth England 162
8 8 Michael Owen - England 150
9 9 Les Ferdinand - England 149
10 10 Teddy Sheringham - England 146
11 11 Robin van Persie - Netherlands 144
12 12 Sergio Agüero Manchester City Argentina 143
13 13 Jimmy Floyd Hasselbaink - Netherlands 127
14 14 Robbie Keane - Ireland 126
15 15 Nicolas Anelka - France 125
16 16 Dwight Yorke - Trinidad And Tobago 123
17 17 Steven Gerrard - England 120
18 18 Ian Wright - England 113
19 19 Dion Dublin - England 111
20 20 Emile Heskey - England 110
However, when changing filters on the site (for instance, in my use case, restricting the table to the current season), and using the arrows to access the next 20 entries in the table, the URL does not change.
I have found the relevant areas of the source code. They are:
<div data-script="pl_stats" data-widget="stats-table" data-current-size="20"
data-stat="" data-type="player" data-page-size="20" data-page="0" data-
comps="1" data-num-entries="2162">
<div class="dropDown noLabel topStatsFilterDropdown" data-listener="true">
<div data-metric="mins_played" class="current currentStatContainer"
aria-expanded="false">Minutes played</div>
<ul class="dropdownList" role="listbox">
I would like to be able to modify the data-metric and data-page fields.
The commonly used web Scraping tools for R is rvest. Install the package rvest in your R Studio using the following code. Having, knowledge of HTML and CSS will be an added advantage. It's observed that most of the Data Scientists are not very familiar with technical knowledge of HTML and CSS.
statsmodels in Python and other packages provide decent coverage for statistical methods, but the R ecosystem is far larger. It's usually more straightforward to do non-statistical tasks in Python. With well-maintained libraries like BeautifulSoup and requests, web scraping in Python is more straightforward than in R.
Data scraping, in its most general form, refers to a technique in which a computer program extracts data from output generated from another program. Data scraping is commonly manifest in web scraping, the process of using an application to extract valuable information from a website.
This solution requires that you have access to a selenium server.
library(RSelenium) # not on cran (install with devtools::install_github("ropensci/RSelenium"))
library(rvest)
# helper functions ---------------------------
# click_el() solves the problem mentioned here:
# https://stackoverflow.com/questions/11908249/debugging-element-is-not-clickable-at-point-error
click_el <- function(rem_dr, el) {
rem_dr$executeScript("arguments[0].click();", args = list(el))
}
# wrapper around findElement()
find_el <- function(rem_dr, xpath) {
rem_dr$findElement("xpath", xpath)
}
# check if an element exists on the dom
el_exists <- function(rem_dr, xpath) {
maybe_el <- read_html(rem_dr$getPageSource()[[1]]) %>%
xml_find_first(xpath = xpath)
length(maybe_el) != 0
}
# try to click on a element if it exists
click_if_exists <- function(rem_dr, xpath) {
if (el_exists(rem_dr, xpath)) {
suppressMessages({
try({
el <- find_el(rem_dr, xpath)
el$clickElement()
}, silent = TRUE
)
})
}
}
# close google adds so they don't get in the way of clicking other elements
maybe_close_ads <- function(rem_dr) {
click_if_exists(rem_dr, '//a[@id="advertClose" and @class="closeBtn"]')
}
# click on button that requires we accept cookies
maybe_accept_cookies <- function(rem_dr) {
click_if_exists(rem_dr, "//div[@class='btn-primary cookies-notice-accept']")
}
# parse the data table you're interested in
get_tbl <- function(rem_dr) {
read_html(rem_dr$getPageSource()[[1]]) %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
}
# actual execution ---------------------------
# first u need to start selenium server...i'm running the server inside a
# docker container and having it listen on port 4445 on my local machine
# (see http://rpubs.com/johndharrison/RSelenium-Basics for more details):
`docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.1`
# connect to selenium server from within r
rem_dr <- remoteDriver(
remoteServerAddr = "localhost", port = 4445L, browserName = "firefox"
)
rem_dr$open()
# go to webpage
rem_dr$navigate("https://www.premierleague.com/stats/top/players/goals")
# close adds
maybe_close_ads(rem_dr)
Sys.sleep(3)
# the seasons to iterate over
start <- 1992:2017 # u may want to replace this with `start <- 1992:1995` when testing
seasons <- paste0(start, "/", substr(start + 1, 3, 4))
# list to hold each season's data
out_list <- vector("list", length(seasons))
names(out_list) <- seasons
for (season in seasons) {
maybe_close_ads(rem_dr)
# to filter the data by season, we first need to click on the "filter by season" drop down
# menu, so that the divs representing the various seasons become active (otherwise,
# we can't click them)
cur_season <- find_el(
rem_dr, '//div[@class="current" and @data-dropdown-current="FOOTBALL_COMPSEASON" and @role="button"]'
)
click_el(rem_dr, cur_season)
Sys.sleep(3)
# now we can select the season of interest
xpath <- sprintf(
'//ul[@data-dropdown-list="FOOTBALL_COMPSEASON"]/li[@data-option-name="%s"]',
season
)
season_lnk <- find_el(rem_dr, xpath)
click_el(rem_dr, season_lnk)
Sys.sleep(3)
# parse the table shown on the first page
tbl <- get_tbl(rem_dr)
# iterate over all additional pages
nxt_page_act <- '//div[@class="paginationBtn paginationNextContainer"]'
nxt_page_inact <- '//div[@class="paginationBtn paginationNextContainer inactive"]'
while (!el_exists(rem_dr, nxt_page_inact)) {
maybe_close_ads(rem_dr)
maybe_accept_cookies(rem_dr)
rem_dr$maxWindowSize()
btn <- find_el(rem_dr, nxt_page_act)
click_el(rem_dr, btn) # click "next button"
maybe_accept_cookies(rem_dr)
new_tbl <- get_tbl(rem_dr)
tbl <- rbind(tbl, new_tbl)
cat(".")
Sys.sleep(2)
}
# put this season's data into the output list
out_list[[season]] <- tbl
print(season)
}
This takes a bit of time to run. I ended up with 6,731 rows of data total (across all seasons) when i ran it.
Another way is to grab the resource directly.
In your browser, open Developer Tools (F12 in Chrome/Chromium), head to "Network", refresh (F5), and look for what looks like a nicely formatted JSON. When we've found it, we copy the link address and the headers (right-click on the resource > Copy link address, Copy request header) as well to impersonate the browser.
Here we need 2: season ids first, then player data:
library(httr)
library(purrr)
library(dplyr)
h <- add_headers(
"Host" = "footballapi.pulselive.com",
"Connection" = "keep-alive",
"Origin" = "https://www.premierleague.com",
"User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36",
"Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
"Accept" = "*/*",
"DNT" = "1",
"Referer" = "https://www.premierleague.com/stats/top/players/goals",
"Accept-Encoding" = "gzip, deflate, br",
"Accept-Language" = "fr,en-US;q=0.9,en;q=0.8"
)
r <- GET("https://footballapi.pulselive.com/football/competitions/1/compseasons?page=0&pageSize=100", h)
seasons <- content(r, type = "application/json")$content %>% transpose() %>% map(unlist) %>% {setNames(.$id, .$label)}
res <- seasons %>%
map(~ {
url <- paste0("https://footballapi.pulselive.com/football/stats/ranked/players/goals",
"?page=0", "&pageSize=", "250", "&compSeasons=", .x, "&comps=1&compCodeForActivePlayer=EN_PR&altIds=true")
# Be gentle
Sys.sleep(0.5 + runif(1))
r <- GET(url, h)
jsonlite::flatten(jsonlite::fromJSON(content(r, as = "text"))$stats$content)
}) %>%
bind_rows(.id = "season")
glimpse(res)
# Observations: 6,460
# Variables: 35
# $ season <chr> "2017/18", "2017/18", "2017/18", "2017/18", "2017/18", "2017/18", "2017/18"...
# $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 14, 14, 16, 16, 16, 16, 16, 16, ...
# $ name <chr> "goals", "goals", "goals", "goals", "goals", "goals", "goals", "goals", "go...
# $ value <dbl> 32, 30, 21, 20, 18, 16, 15, 14, 13, 12, 12, 12, 12, 11, 11, 10, 10, 10, 10,...
# $ description <chr> "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", ...
# $ owner.playerId <dbl> 47184, 6566, 6009, 11259, 31712, 49685, 19511, 48542, 31068, 6288, 11264, 4...
# $ owner.active <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...
# $ owner.age <chr> "25 years 336 days", "24 years 293 days", "29 years 349 days", "31 years 12...
# $ owner.id <dbl> 5178, 3960, 4328, 8979, 4316, 4290, 13511, 6899, 19680, 4503, 8983, 4772, 4...
# $ owner.info.position <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "M", "M", "F", "M", "F", "F", ...
# $ owner.info.shirtNum <dbl> 11, 32, 10, 9, NA, 9, 9, 9, 33, 10, 26, 17, 7, 7, 9, 14, 23, 19, 4, 10, 19,...
# $ owner.info.positionInfo <chr> "Left/Centre/Right Second Striker", "Centre Striker", "Centre Striker", "Ce...
# $ owner.info.loan <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
# $ owner.nationalTeam.isoCode <chr> "EG", "GB-ENG", "AR", "GB-ENG", "GB-ENG", "BE", "BR", "FR", "BR", "BE", "DZ...
# $ owner.nationalTeam.country <chr> "Egypt", "England", "Argentina", "England", "England", "Belgium", "Brazil",...
# $ owner.nationalTeam.demonym <chr> "Egyptian", "English", NA, "English", "English", "Belgian", "Brazilian", "F...
# $ owner.currentTeam.name <chr> "Liverpool", "Tottenham Hotspur", "Manchester City", "Leicester City", "Man...
# $ owner.currentTeam.teamType <chr> "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FI...
# $ owner.currentTeam.shortName <chr> "Liverpool", "Spurs", "Man City", "Leicester", "Man City", "Man Utd", "Live...
# $ owner.currentTeam.id <dbl> 10, 21, 11, 26, 11, 12, 10, 1, 11, 4, 26, 131, 21, 25, 4, 1, 21, 10, 6, 7, ...
# $ owner.currentTeam.club.name <chr> "Liverpool", "Tottenham Hotspur", "Manchester City", "Leicester City", "Man...
# $ owner.currentTeam.club.abbr <chr> "LIV", "TOT", "MCI", "LEI", "MCI", "MUN", "LIV", "ARS", "MCI", "CHE", "LEI"...
# $ owner.currentTeam.club.id <dbl> 10, 21, 11, 26, 11, 12, 10, 1, 11, 4, 26, 131, 21, 25, 4, 1, 21, 10, 6, 7, ...
# $ owner.currentTeam.altIds.opta <chr> "t14", "t6", "t43", "t13", "t43", "t1", "t14", "t3", "t43", "t8", "t13", "t...
# $ owner.birth.place <chr> "Basyoun", NA, "Buenos Aires", NA, "Kingston", "Antwerp", "Maceio", "Lyon",...
# $ owner.birth.date.millis <dbl> 708566400000, 743817600000, 581212800000, 537321600000, 786844800000, 73725...
# $ owner.birth.date.label <chr> "15 June 1992", "28 July 1993", "2 June 1988", "11 January 1987", "8 Decemb...
# $ owner.birth.country.isoCode <chr> "EG", "GB-ENG", "AR", "GB-ENG", "JM", "BE", "BR", "FR", "BR", "BE", "FR", "...
# $ owner.birth.country.country <chr> "Egypt", "England", "Argentina", "England", "Jamaica", "Belgium", "Brazil",...
# $ owner.birth.country.demonym <chr> "Egyptian", "English", NA, "English", "Jamaican", "Belgian", "Brazilian", "...
# $ owner.name.display <chr> "Mohamed Salah", "Harry Kane", "Sergio Agüero", "Jamie Vardy", "Raheem Ster...
# $ owner.name.first <chr> "Mohamed", "Harry", "Sergio", "Jamie", "Raheem", "Romelu", "Roberto Firmino...
# $ owner.name.last <chr> "Salah Ghaly", "Kane", "Agüero", "Vardy", "Sterling", "Lukaku", "Barbosa de...
# $ owner.name.middle <chr> NA, NA, NA, NA, "Shaquille", "Menama", NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
# $ owner.altIds.opta <chr> "p118748", "p78830", "p37572", "p101668", "p103955", "p66749", "p92217", "p...
Edit
It seems you want the "All seasons" data, this is actually a little simpler:
url <- paste0("https://footballapi.pulselive.com/football/stats/ranked/players/goals?",
"page=0&pageSize=", 1000, "&comps=1&compCodeForActivePlayer=EN_PR&altIds=true")
r <- GET(url, h)
dat <- jsonlite::flatten(jsonlite::fromJSON(content(r, as = "text"))$stats$content)
# Observations: 1,000
# Variables: 34
# $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, ...
# $ name <chr> "goals", "goals", "goals", "goals", "goals", "goals", "goals", "goals", "go...
# $ value <dbl> 260, 208, 187, 177, 175, 163, 162, 150, 149, 146, 144, 143, 127, 126, 125, ...
# $ description <chr> "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", ...
# $ owner.playerId <dbl> 1440, 49682, 4848, 14004, 25936, 55008, 47174, 6476, 3754, 110022, 93894, 6...
# $ owner.active <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, ...
# $ owner.age <chr> "47 years 277 days", "32 years 205 days", "46 years 214 days", "39 years 33...
# $ owner.id <dbl> 89, 2064, 725, 800, 1659, 277, 1526, 1208, 462, 576, 2616, 4328, 1413, 1692...
# $ owner.info.position <chr> "F", "F", "F", "M", NA, "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "...
# $ owner.info.shirtNum <dbl> 9, 10, 20, 18, NA, 27, 18, 10, NA, NA, 32, 10, 18, 20, 39, 19, 8, NA, NA, 1...
# $ owner.info.positionInfo <chr> "Forward", "Centre Second Striker", "Forward", "Centre Central Midfielder",...
# $ owner.info.loan <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
# $ owner.nationalTeam.isoCode <chr> "GB-ENG", "GB-ENG", "GB-ENG", "GB-ENG", "FR", "GB-ENG", "GB-ENG", "GB-ENG",...
# $ owner.nationalTeam.country <chr> "England", "England", "England", "England", "France", "England", "England",...
# $ owner.nationalTeam.demonym <chr> "English", "English", "English", "English", "French", "English", "English",...
# $ owner.birth.place <chr> NA, "Liverpool", NA, "Romford", "Les Ulis", NA, "Beckton", NA, NA, NA, "Rot...
# $ owner.birth.date.millis <dbl> 19353600000, 498960000000, 56332800000, 267148800000, 240624000000, 1662336...
# $ owner.birth.date.label <chr> "13 August 1970", "24 October 1985", "15 October 1971", "20 June 1978", "17...
# $ owner.birth.country.isoCode <chr> "GB-ENG", "GB-ENG", "GB-ENG", "GB-ENG", "FR", "GB-ENG", "GB-ENG", "GB-ENG",...
# $ owner.birth.country.country <chr> "England", "England", "England", "England", "France", "England", "England",...
# $ owner.birth.country.demonym <chr> "English", "English", "English", "English", "French", "English", "English",...
# $ owner.name.display <chr> "Alan Shearer", "Wayne Rooney", "Andrew Cole", "Frank Lampard", "Thierry He...
# $ owner.name.first <chr> "Alan", "Wayne", "Andrew", "Frank", "Thierry", "Robbie", "Jermain", "Michae...
# $ owner.name.last <chr> "Shearer", "Rooney", "Cole", "Lampard", "Henry", "Fowler", "Defoe", "Owen",...
# $ owner.name.middle <chr> NA, "Mark", NA, "James", NA, NA, "Colin", NA, NA, NA, NA, NA, NA, NA, NA, N...
# $ owner.altIds.opta <chr> "p1", "p13017", "p1820", "p2051", "p1619", "p1794", "p7958", "p1795", "p190...
# $ owner.currentTeam.name <chr> NA, "Everton", NA, "Manchester City", "New York Red Bulls", NA, "AFC Bourne...
# $ owner.currentTeam.teamType <chr> NA, "FIRST", NA, "FIRST", "FIRST", NA, "FIRST", NA, NA, NA, "FIRST", "FIRST...
# $ owner.currentTeam.shortName <chr> NA, "Everton", NA, "Man City", "New York Red Bulls", NA, "AFC Bournemouth",...
# $ owner.currentTeam.id <dbl> NA, 7, NA, 11, 599, NA, 127, NA, NA, NA, 236, 11, NA, NA, NA, NA, NA, NA, N...
# $ owner.currentTeam.club.name <chr> NA, "Everton", NA, "Manchester City", "New York Red Bulls", NA, "Bournemout...
# $ owner.currentTeam.club.abbr <chr> NA, "EVE", NA, "MCI", "NY", NA, "BOU", NA, NA, NA, "FEY", "MCI", NA, NA, NA...
# $ owner.currentTeam.club.id <dbl> NA, 7, NA, 11, 479, NA, 127, NA, NA, NA, 236, 11, NA, NA, NA, NA, NA, NA, N...
# $ owner.currentTeam.altIds.opta <chr> NA, "t11", NA, "t43", "t399", NA, "t91", NA, NA, NA, "t198", "t43", NA, NA,...
(I haven't looked at the data in detail, but at first sight, it seems this is not the same as simply aggregating:
res %>%
group_by(owner.playerId, owner.name.first, owner.name.last) %>%
summarise(total_goals = sum(value, na.rm = TRUE)) %>%
arrange(desc(total_goals))
# # A tibble: 433 x 4
# # Groups: owner.playerId, owner.name.first [433]
# owner.playerId owner.name.first owner.name.last total_goals
# <dbl> <chr> <chr> <dbl>
# 1 6566 Harry Kane 84
# 2 6009 Sergio Agüero 65
# 3 49685 Romelu Lukaku 59
# 4 11259 Jamie Vardy 57
# 5 93889 Alexis Sánchez 46
# 6 14848 Bamidele Alli 37
# 7 19511 Roberto Firmino Barbosa de Oliveira 36
# 8 11264 Riyad Mahrez 35
# 9 94221 Olivier Giroud 35
# 10 30549 Sadio Mané 34
# # ... with 423 more rows
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With