Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping javascript website in R

I want to scrape the match time and date from this url:

http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary

By using the chrome dev tools, I can see this appears to be generated using the following code:

<td colspan="3" id="utime" class="mstat-date">01:20 AM, October 29, 2014</td>

But this is not in the source html.

I think this is because its java (correct me if Im wrong). How can I scrape this information using R?

like image 807
Liam Flynn Avatar asked Oct 29 '14 13:10

Liam Flynn


People also ask

Can you do web scraping with R?

The commonly used web Scraping tools for R is rvest. Install the package rvest in your R Studio using the following code. Having, knowledge of HTML and CSS will be an added advantage. It's observed that most of the Data Scientists are not very familiar with technical knowledge of HTML and CSS.

Is web scraping possible with JavaScript?

Benefits of Web Scraping with JavaScriptGathering data from different sources for analysis can be automated with web scraping easily. It can be used to collect data for testing and training machine learning models.

Is web scraping easier in Python or R?

Python is your best bet. Libraries such as requests or HTTPX makes it very easy to scrape websites that don't require JavaScript to work correctly. Python offers a lot of simple-to-use HTTP clients. And once you get the response, it's also very easy to parse the HTML with BeautifulSoup for example.


2 Answers

So, RSelenium is not the only answer (anymore). If you can install the PhantomJS binary (grab phantomjs binaries from here: http://phantomjs.org/) then you can use it to render the HTML and scrape it with rvest (similar to the RSelenium approach but doesn't require java):

library(rvest)

# render HTML from the site with phantomjs

url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"

writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
    console.log(page.content); //page source
    phantom.exit();
});", url), con="scrape.js")

system("phantomjs scrape.js > scrape.html", intern = T)

# extract the content you need
pg <- html("scrape.html")
pg %>% html_nodes("#utime") %>% html_text()

## [1] "10:20 AM, October 28, 2014"
like image 81
hrbrmstr Avatar answered Oct 03 '22 23:10

hrbrmstr


You could also use docker as the web driver (in place of selenium)

You will still need to install phantomjs, and docker too. Then run:

library(RSelenium)

url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"

system('docker run -d -p 4445:4444 selenium/standalone-chrome') 
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "chrome")
remDr$open()
remDr$navigate(url)

writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
    console.log(page.content); //page source
    phantom.exit();
});", url), con="scrape.js")

system("phantomjs scrape.js > scrape.html", intern = T)

# extract the content you need
pg <- read_html("scrape.html")
pg %>% html_nodes("#utime") %>% html_text()

# [1] "10:20 AM, October 28, 2014"
like image 35
stevec Avatar answered Oct 04 '22 00:10

stevec