Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to scrape this squawka page?

I am trying to extract the following information:

On the page

http://epl.squawka.com/stoke-city-vs-arsenal/01-03-2014/english-barclays-premier-league/matches

pressing the red "full stats" button opens a menu that includes (on the left hand side) the button "Crosses". This opens, on the right side of the screen, an image of a soccer pitch with 19 arrows on it, these are the cross passes by Stoke in the Stoke-Arsenal match. They are color coded, red=not completed, green = completed, yellow = key passes. When you click on an arrow, it will tell you who gave the pass and in what minute of the game. Also, the arrows show where the player was standing when he gave the pass and where the player was who was being passed to.

I would like to be able to scrape this page such that I get a table with the columns:

team; name-of-sender; location-of-sender; location-of-receiver; minute; color-of-arrow

This is the set of cross passes made by Stoke, I also would like to automatically repeat this for Arsenal (hence, the column "club" in the table above).

Although I have scraped webpages in the past, these have all been static fairly straighforward pages, and I am totally dumbfounded as to how to scrape the info from this page. I would really appreciate help as to how to scrape the data I just described. I am well-versed in R, so I would especially appreciate code that would help me achieve this in R, but I am also quite appreciative of help that uses other language or software.

Thank you, Peter

like image 951
Peter Verbeet Avatar asked Mar 01 '14 22:03

Peter Verbeet


1 Answers

Peter as the guys indicated you can do this with Selenium. I also like to use the excellent selectr package The idea is to briefly interact with the site then do the rest elsewhere. squawkData should contain everything needed.

# RSelenium::startServer() # if needed
require(RSelenium)
remDr <- remoteDriver()
remDr$open()
remDr$setImplicitWaitTimeout(3000)
remDr$navigate("http://epl.squawka.com/stoke-city-vs-arsenal/01-03-2014/english-barclays-premier-league/matches")
squawkData <- remDr$executeScript("return new XMLSerializer().serializeToString(squawkaDp.xml);", list())
require(selectr)
example <- querySelectorAll(xmlParse(squawkData[[1]]), "crosses time_slice")
example[[1]]


<time_slice name="0 - 5" id="1">
  <event player_id="531" mins="4" secs="39" minsec="279" team="44" type="Failed">
    <start>73.1,87.1</start>
    <end>97.9,49.1</end>
  </event>
</time_slice> 

DISCLAIMER I am the author of the RSelenium package. A basic vignette on operation can be viewed at RSelenium basics and RSelenium: Testing Shiny apps.

Further info can be accessed easily using selectr:

> xmlValue(querySelectorAll(xmlParse(squawkData[[1]]), "players #531 name")[[1]])
[1] "Charlie Adam"

> xmlValue(querySelectorAll(xmlParse(squawkData[[1]]), "game team#44 long_name")[[1]])
[1] "Stoke City"

UPDATE:
To process example into a dataframe you can do something like

out <- lapply(example, function(x){
# handle each event
  if(length(x['event']) > 0){
    res <- lapply(x['event'], function(y){
      matchAttrs <- as.list(xmlAttrs(y))
      matchAttrs$start <- xmlValue(y['start']$start)
      matchAttrs$end <- xmlValue(y['end']$end)
      matchAttrs
    })
    return(do.call(rbind.data.frame, res))
  }
}
)

> head(do.call(rbind, out))
        player_id mins secs minsec team   type     start       end
event         531    4   39    279   44 Failed 73.1,87.1 97.9,49.1
event5        311    6   33    393   31 Failed 92.3,13.1 93.0,31.0
event1        376    8   57    537   31 Failed  97.7,6.1 96.7,16.4
event6        311   13   50    830   31 Failed  99.5,0.5 94.9,42.6
event11       311   14   11    851   31 Failed  99.5,0.5 93.1,51.0
event7        311   17   41   1061   31 Failed 99.5,99.5 92.6,50.1
like image 118
jdharrison Avatar answered Oct 22 '22 04:10

jdharrison