Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape the web for the list of R release dates?

Tags:

r

To celebrate the 20,000th question with the r-tag on Stack Overflow, please help me to extract the R release dates from the Wikipedia page.

My attempts:

library(XML) x <- readHTMLTable("http://en.wikipedia.org/wiki/R_(programming_language)") 

This doesn't work because the table is in fact a list, not an HTML table.

library(httr) x <- GET("http://en.wikipedia.org/wiki/R_(programming_language)") text <- content(x, "parsed") 

This extracts the text, but my xpath is rusty, so I couldn't extract the relevant release dates.

How can I do this?


PS. The Wikipedia page is the only source I could find, but please feel free to post a solution using canonical source, if there is one.

like image 351
Andrie Avatar asked Nov 26 '12 15:11

Andrie


People also ask

How do I scrape web data in R?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.

Can R be used for web scraping?

There are several web scraping tools out there to perform the task and various languages too, having libraries that support web scraping. Among all these languages, R is considered as one of the programming languages for Web Scraping because of features like – a rich library, easy to use, dynamically typed, etc.

What is Rvest R?

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.


2 Answers

Why don't you use the file dates on the canonical ftp archive in Vienna?

Edit: Eg

 lynx -dump http://cran.r-project.org/src/base/R-0/ | grep tgz | grep -v http 

gets you a table you can parse from R. Gets you file sizes as a benefit. Rinse and repeat for R-1 and R-2 directories.

like image 89
Dirk Eddelbuettel Avatar answered Sep 23 '22 19:09

Dirk Eddelbuettel


Edited to include R version 3.0.0 and above

Dirk Eddelbuettel provided the canonical link to the .0 releases of R.

Here is some code that collates the tables from the three separate URLs, one for each major release, and then plot it:

library(XML) library(lattice)   getRdates <- function(){   url <- paste0("http://cran.r-project.org/src/base/R-", 0:3)   x <- lapply(url, function(x)readHTMLTable(x, stringsAsFactors=FALSE)[[1]])   x <- do.call(rbind, x)   x <- x[grep("R-(.*)(\\.tar\\.gz|\\.tgz)", x$Name), c(-1, -5)]   x$Release <- gsub("(R-.*)\\.(tar\\.gz|tgz)", "\\1", x$Name)   x$Date <- as.POSIXct(x[["Last modified"]], format="%d-%b-%Y %H:%M")   x$Release <- reorder(x$Release, x$Date)   x }  x <- getRdates() dotplot(Release~Date, data=x) 

enter image description here

like image 40
Andrie Avatar answered Sep 23 '22 19:09

Andrie