I am trying to scrape the data, using R, from this site: http://www.soccer24.com/kosovo/superliga/results/#
I can do the following:
library(rvest)
doc <- html("http://www.soccer24.com/kosovo/superliga/results/")
but am stumped on how to axtually get to the data. This is because the actual data on the website seems to be generated by Javascript. What I can do is
html_text(doc)
but that gives a long blurp of weird text (which does include the data, but interspersed with odd code and it's not at all clear how I would parse that.
What I want to extract is the match data (date, time, teams, result) for all of the matches. No other data is needed from this site.
Can anyone provide some hints as to how to extract that data from this site?
Among all these languages, R is considered as one of the programming languages for Web Scraping because of features like – a rich library, easy to use, dynamically typed, etc. The commonly used web Scraping tools for R is rvest. Install the package rvest in your R Studio using the following code.
A URL for an article that supports the explanation (embedded within the text) The data that we want to extract from the web page. To read the web page into R, we can use the rvest package, made by the R guru Hadley Wickham. This package is inspired by libraries like Beautiful Soup, to make it easy to scrape data from html web pages.
This package is inspired by libraries like Beautiful Soup, to make it easy to scrape data from html web pages. The first important function to use is read_html (), which returns an XML document that contains all the information about the web page. As explained in Kevin’s tutorial, every record has the following structure in the HTML code:
Web scraping opens up opportunities and gives us the tools needed to actually create data sets when we can’t find the data we’re looking for. And since we’re using R to do the web scraping, we can simply run our code again to get an updated data set if the sites we use get updated.
Using Selenium
with phantomjs
library(RSelenium)
pJS <- phantom()
remDr <- remoteDriver(browserName = "phantomjs")
appURL <- "http://www.soccer24.com/kosovo/superliga/results/#"
remDr$open()
remDr$navigate(appURL)
if you want to press the more data button until it is not visible (all matches presumed showing):
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
while(webElem$isElementDisplayed()[[1]]){
webElem$clickElement()
Sys.sleep(5)
webElem <- remDr$findElement("css", "#tournament-page-results-more a")
}
doc <- htmlParse(remDr$getPageSource()[[1]])
Remove unwanted round data and use XML::readHTMLTable
for simplicity
# remove unwanted rounds html. Sometimes there are end of season extra games.
# These are presented in a seperate table.
invisible(doc["//table/*/tr[@class='event_round']", fun = removeNodes])
appData <- readHTMLTable(doc, which = seq(length(doc["//table"])-1), stringsAsFactors = FALSE, trim = TRUE)
if(!is.data.frame(appData)){appData <- do.call(rbind, appData)}
row.names(appData) <- NULL
names(appData) <- c("blank", "Date", "hteam", "ateam", "score")
pJS$stop()
> head(appData)
blank Date hteam ateam score
1 01.04. 18:00 Ferronikeli Ferizaj 4 : 0
2 01.04. 18:00 Istogu Hajvalia 2 : 1
3 01.04. 18:00 Kosova Vushtrri Trepca Mitrovice 1 : 0
4 01.04. 18:00 Prishtina Drenica 3 : 0
5 31.03. 18:00 Besa Peje Drita 1 : 0
6 31.03. 18:00 Trepca 89 Vellaznimi 2 : 0
> tail(appData)
blank Date hteam ateam score
115 17.08. 22:00 Besa Peje Trepca 89 3 : 3
116 17.08. 22:00 Ferronikeli Hajvalia 2 : 5
117 17.08. 22:00 Trepca Mitrovice Ferizaj 1 : 0
118 17.08. 22:00 Vellaznimi Drenica 2 : 1
119 16.08. 22:00 Kosova Vushtrri Drita 0 : 1
120 16.08. 22:00 Prishtina Istogu 2 : 1
carry out further formatting as needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With