Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using R to scrape the link address of a downloadable file from a web page?

I'm trying to automate a process that involves downloading .zip files from a couple of web pages and extracting the .csvs they contain. The challenge is that the .zip file names, and thus the link addresses, change weekly or annually, depending on the page. Is there a way to scrape the current link addresses from those pages so I can then feed those addresses to a function that downloads the files?

One of the target pages is this one. The file I want to download is the second bullet under the header "2015 Realtime Complete All Africa File"---i.e., the zipped .csv. As I write, that file is labeled "Realtime 2015 All Africa File (updated 11th July 2015)(csv)" on the web page, and the link address that I want is http://www.acleddata.com/wp-content/uploads/2015/07/ACLED-All-Africa-File_20150101-to-20150711_csv.zip, but that should change later today, because the data are updated each Monday---hence my challenge.

I tried but failed to automate extraction of that .zip file name with 'rvest' and the selectorgadet extension in Chrome. Here's how that went:

> library(rvest)
> realtime.page <- "http://www.acleddata.com/data/realtime-data-2015/"
> realtime.html <- html(realtime.page)
> realtime.link <- html_node(realtime.html, xpath = "//ul[(((count(preceding-sibling::*) + 1) = 7) and parent::*)]//li+//li//a")
> realtime.link
[1] NA

The xpath in that call to html_node() came from highlighting just the (csv) portion of the Realtime 2015 All Africa File (updated 11th July 2015)(csv) field in green and then clicking on enough other highlighted bits of the page to eliminate all the yellow and leave only red and green.

Did I make a small mistake in that process, or am I just entirely on the wrong track here? As you can tell, I have zero experience with HTML and web-scraping, so I'd really appreciate some assistance.

like image 806
ulfelder Avatar asked Jul 20 '15 12:07

ulfelder


1 Answers

I think you're trying to do too much in a single xpath expression - I'd attack the problem in a sequence of smaller steps:

library(rvest)
library(stringr)
page <- html("http://www.acleddata.com/data/realtime-data-2015/")

page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.xlsx") %>% # find those that end in xlsx
  .[[1]]                    # look at the first one
like image 140
hadley Avatar answered Oct 16 '22 22:10

hadley