I want to scrape the match time and date from this url:
http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary
By using the chrome dev tools, I can see this appears to be generated using the following code:
<td colspan="3" id="utime" class="mstat-date">01:20 AM, October 29, 2014</td>
But this is not in the source html.
I think this is because its java (correct me if Im wrong). How can I scrape this information using R?
The commonly used web Scraping tools for R is rvest. Install the package rvest in your R Studio using the following code. Having, knowledge of HTML and CSS will be an added advantage. It's observed that most of the Data Scientists are not very familiar with technical knowledge of HTML and CSS.
Benefits of Web Scraping with JavaScriptGathering data from different sources for analysis can be automated with web scraping easily. It can be used to collect data for testing and training machine learning models.
Python is your best bet. Libraries such as requests or HTTPX makes it very easy to scrape websites that don't require JavaScript to work correctly. Python offers a lot of simple-to-use HTTP clients. And once you get the response, it's also very easy to parse the HTML with BeautifulSoup for example.
So, RSelenium is not the only answer (anymore). If you can install the PhantomJS binary (grab phantomjs binaries from here: http://phantomjs.org/) then you can use it to render the HTML and scrape it with rvest
(similar to the RSelenium approach but doesn't require java):
library(rvest)
# render HTML from the site with phantomjs
url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
system("phantomjs scrape.js > scrape.html", intern = T)
# extract the content you need
pg <- html("scrape.html")
pg %>% html_nodes("#utime") %>% html_text()
## [1] "10:20 AM, October 28, 2014"
You could also use docker as the web driver (in place of selenium)
You will still need to install phantomjs, and docker too. Then run:
library(RSelenium)
url <- "http://www.scoreboard.com/game/rosol-l-goffin-d-2014/8drhX07d/#game-summary"
system('docker run -d -p 4445:4444 selenium/standalone-chrome')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "chrome")
remDr$open()
remDr$navigate(url)
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
system("phantomjs scrape.js > scrape.html", intern = T)
# extract the content you need
pg <- read_html("scrape.html")
pg %>% html_nodes("#utime") %>% html_text()
# [1] "10:20 AM, October 28, 2014"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With