Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping .asp site with R

I'm scraping http://www.progarchives.com/album.asp?id= and get a warning message:

Warning message:
XML content does not seem to be XML:
http://www.progarchives.com/album.asp?id=2
http://www.progarchives.com/album.asp?id=3 http://www.progarchives.com/album.asp?id=4
http://www.progarchives.com/album.asp?id=5

The scraper works for each page separately but not for the urls b1=2:b2=1000.

 library(RCurl)
 library(XML)

getUrls <- function(b1,b2){
   root="http://www.progarchives.com/album.asp?id="
   urls <- NULL
     for (bandid in b1:b2){
   urls <- c(urls,(paste(root,bandid,sep="")))
  }
  return(urls)
}

prog.arch.scraper <- function(url){
SOURCE <- getUrls(b1=2,b2=1000)
PARSED <- htmlParse(SOURCE)
album <- xpathSApply(PARSED,"//h1[1]",xmlValue)
date <- xpathSApply(PARSED,"//strong[1]",xmlValue)
band <- xpathSApply(PARSED,"//h2[1]",xmlValue)
return(c(band,album,date))
}

prog.arch.scraper(urls)
like image 814
monarque13 Avatar asked Mar 08 '15 23:03

monarque13


1 Answers

Here's an alternate approach with rvest and dplyr:

library(rvest)
library(dplyr)
library(pbapply)

base_url <- "http://www.progarchives.com/album.asp?id=%s"

get_album_info <- function(id) {

  pg <- html(sprintf(base_url, id))
  data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
             date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
             band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
             stringsAsFactors=FALSE)

}

albums <- bind_rows(pblapply(2:10, get_album_info))

head(albums)

## Source: local data frame [6 x 3]
## 
##                        album                           date      band
## 1                    FOXTROT Studio Album, released in 1972   Genesis
## 2              NURSERY CRYME Studio Album, released in 1971   Genesis
## 3               GENESIS LIVE         Live, released in 1973   Genesis
## 4        A TRICK OF THE TAIL Studio Album, released in 1976   Genesis
## 5 FROM GENESIS TO REVELATION Studio Album, released in 1969   Genesis
## 6           GRATUITOUS FLASH Studio Album, released in 1984 Abel Ganz

I didn't feel like barraging the site with a ton of reqs so bump up the sequence for your use. pblapply gives you a free progress bar.

To be kind to the site (esp since it doesn't explicitly prohibit scraping) you might want to throw a Sys.sleep(10) at the end of the get_album_info function.

UPDATE

To handle server errors (in this case 500, but it'll work for others, too), you can use try:

library(rvest)
library(dplyr)
library(pbapply)
library(data.table)

base_url <- "http://www.progarchives.com/album.asp?id=%s"

get_album_info <- function(id) {

  pg <- try(html(sprintf(base_url, id)), silent=TRUE)

  if (inherits(pg, "try-error")) {
    data.frame(album=character(0), date=character(0), band=character(0))
  } else {
    data.frame(album=pg %>% html_nodes(xpath="//h1[1]") %>% html_text(),
               date=pg %>% html_nodes(xpath="//strong[1]") %>% html_text(),
               band=pg %>% html_nodes(xpath="//h2[1]") %>% html_text(),
               stringsAsFactors=FALSE)
  }

}

albums <- rbindlist(pblapply(c(9:10, 23, 28, 29, 30), get_album_info))

##                       album                           date         band
## 1: THE DANGERS OF STRANGERS Studio Album, released in 1988    Abel Ganz
## 2:    THE DEAFENING SILENCE Studio Album, released in 1994    Abel Ganz
## 3:             AD INFINITUM Studio Album, released in 1998 Ad Infinitum

You won't get any entries for the errant pages (in this case it just returns id 9, 10 and 30's entries).

like image 86
hrbrmstr Avatar answered Sep 19 '22 19:09

hrbrmstr