I am trying to extract the following information:
On the page
http://epl.squawka.com/stoke-city-vs-arsenal/01-03-2014/english-barclays-premier-league/matches
pressing the red "full stats" button opens a menu that includes (on the left hand side) the button "Crosses". This opens, on the right side of the screen, an image of a soccer pitch with 19 arrows on it, these are the cross passes by Stoke in the Stoke-Arsenal match. They are color coded, red=not completed, green = completed, yellow = key passes. When you click on an arrow, it will tell you who gave the pass and in what minute of the game. Also, the arrows show where the player was standing when he gave the pass and where the player was who was being passed to.
I would like to be able to scrape this page such that I get a table with the columns:
team; name-of-sender; location-of-sender; location-of-receiver; minute; color-of-arrow
This is the set of cross passes made by Stoke, I also would like to automatically repeat this for Arsenal (hence, the column "club" in the table above).
Although I have scraped webpages in the past, these have all been static fairly straighforward pages, and I am totally dumbfounded as to how to scrape the info from this page. I would really appreciate help as to how to scrape the data I just described. I am well-versed in R, so I would especially appreciate code that would help me achieve this in R, but I am also quite appreciative of help that uses other language or software.
Thank you, Peter
Peter as the guys indicated you can do this with Selenium. I also like to use the excellent selectr package The idea is to briefly interact with the site then do the rest elsewhere. squawkData should contain everything needed.
# RSelenium::startServer() # if needed
require(RSelenium)
remDr <- remoteDriver()
remDr$open()
remDr$setImplicitWaitTimeout(3000)
remDr$navigate("http://epl.squawka.com/stoke-city-vs-arsenal/01-03-2014/english-barclays-premier-league/matches")
squawkData <- remDr$executeScript("return new XMLSerializer().serializeToString(squawkaDp.xml);", list())
require(selectr)
example <- querySelectorAll(xmlParse(squawkData[[1]]), "crosses time_slice")
example[[1]]
<time_slice name="0 - 5" id="1">
<event player_id="531" mins="4" secs="39" minsec="279" team="44" type="Failed">
<start>73.1,87.1</start>
<end>97.9,49.1</end>
</event>
</time_slice>
DISCLAIMER I am the author of the RSelenium package. A basic vignette on operation can be viewed at RSelenium basics and RSelenium: Testing Shiny apps.
Further info can be accessed easily using selectr:
> xmlValue(querySelectorAll(xmlParse(squawkData[[1]]), "players #531 name")[[1]])
[1] "Charlie Adam"
> xmlValue(querySelectorAll(xmlParse(squawkData[[1]]), "game team#44 long_name")[[1]])
[1] "Stoke City"
UPDATE:
To process example into a dataframe you can do something like
out <- lapply(example, function(x){
# handle each event
if(length(x['event']) > 0){
res <- lapply(x['event'], function(y){
matchAttrs <- as.list(xmlAttrs(y))
matchAttrs$start <- xmlValue(y['start']$start)
matchAttrs$end <- xmlValue(y['end']$end)
matchAttrs
})
return(do.call(rbind.data.frame, res))
}
}
)
> head(do.call(rbind, out))
player_id mins secs minsec team type start end
event 531 4 39 279 44 Failed 73.1,87.1 97.9,49.1
event5 311 6 33 393 31 Failed 92.3,13.1 93.0,31.0
event1 376 8 57 537 31 Failed 97.7,6.1 96.7,16.4
event6 311 13 50 830 31 Failed 99.5,0.5 94.9,42.6
event11 311 14 11 851 31 Failed 99.5,0.5 93.1,51.0
event7 311 17 41 1061 31 Failed 99.5,99.5 92.6,50.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With