stumped on how to scrape the data from this site (using R)

Tags:

I am trying to scrape the data, using R, from this site: http://www.soccer24.com/kosovo/superliga/results/#

I can do the following:

library(rvest)
doc <- html("http://www.soccer24.com/kosovo/superliga/results/")

but am stumped on how to axtually get to the data. This is because the actual data on the website seems to be generated by Javascript. What I can do is

html_text(doc)

but that gives a long blurp of weird text (which does include the data, but interspersed with odd code and it's not at all clear how I would parse that.

What I want to extract is the match data (date, time, teams, result) for all of the matches. No other data is needed from this site.

Can anyone provide some hints as to how to extract that data from this site?

645

asked Apr 03 '15 11:04

Peter Verbeet

Video Answer

1 Answers

Using Selenium with phantomjs

library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)

if you want to press the more data button until it is not visible (all matches presumed showing):

webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
  webElem$clickElement()
  Sys.sleep(5)
  webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])

Remove unwanted round data and use XML::readHTMLTable for simplicity

# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank         Date           hteam            ateam score
1       01.04. 18:00     Ferronikeli          Ferizaj 4 : 0
2       01.04. 18:00          Istogu         Hajvalia 2 : 1
3       01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4       01.04. 18:00       Prishtina          Drenica 3 : 0
5       31.03. 18:00       Besa Peje            Drita 1 : 0
6       31.03. 18:00       Trepca 89       Vellaznimi 2 : 0

> tail(appData)
    blank         Date            hteam     ateam score
115       17.08. 22:00        Besa Peje Trepca 89 3 : 3
116       17.08. 22:00      Ferronikeli  Hajvalia 2 : 5
117       17.08. 22:00 Trepca Mitrovice   Ferizaj 1 : 0
118       17.08. 22:00       Vellaznimi   Drenica 2 : 1
119       16.08. 22:00  Kosova Vushtrri     Drita 0 : 1
120       16.08. 22:00        Prishtina    Istogu 2 : 1

carry out further formatting as needed.

188

answered Oct 21 '22 13:10

jdharrison

Related questions
                            
                                reading in a text file with a SUB (1a) (Control-Z) character in R on Windows
                            
                                Errors while trying to fit gamma distribution with R fitdistr{MASS}
                            
                                Error when compiling pdf using knitr in rstudio
                            
                                x axis and y axis labels in pheatmap in R
                            
                                How data.table sorts strings when setting key
                            
                                How to get list of packages used in a knitr .Rnw document?
                            
                                When simulating multivariate data for regression, how can I set the R-squared (example code included)?
                            
                                How do I correctly close a connection in R, so its connection 'slot' gets released?
                            
                                Creating good kable output in RStudio
                            
                                R Shiny navbarMenu
                            
                                Platform neutral way to check if a program exists (e.g. pdfcrop) while creating vignette
                            
                                Order of operator precedence when using ":" (the colon)
                            
                                xtable in .Rmd then knit as pdf in rstudio shows % comments
                            
                                Can't find gfortran 4.8 to build package
                            
                                Extract week number from POSIXct object
                            
                                Plot time series and forecast simultaneously using ggplot2
                            
                                ggplot2 Force y-axis to start at origin and float y-axis upper limit
                            
                                More effective merging of matched column with duplicates in data.table
                            
                                Find the probability density of a new data point using "density" function in R
                            
                                RStudio opens documentation in web browser

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

stumped on how to scrape the data from this site (using R)

Tags:

r

web-scraping

rselenium

rvest

Peter Verbeet

People also ask

Video Answer

1 Answers

jdharrison

Recent Activity

Donate For Us