Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimise scraping with getURL() in R

I am trying to scrape all bills from two pages on the website of the French lower chamber of parliament. The pages cover 2002-2012 and represent less than 1,000 bills each.

For this, I scrape with getURL through this loop:

b <- "http://www.assemblee-nationale.fr" # base
l <- c("12","13") # legislature id

lapply(l, FUN = function(x) {
  print(data <- paste(b, x, "documents/index-dossier.asp", sep = "/"))

  # scrape
  data <- getURL(data); data <- readLines(tc <- textConnection(data)); close(tc)
  data <- unlist(str_extract_all(data, "dossiers/[[:alnum:]_-]+.asp"))
  data <- paste(b, x, data, sep = "/")
  data <- getURL(data)
  write.table(data,file=n <- paste("raw_an",x,".txt",sep="")); str(n)
})

Is there any way to optimise the getURL() function here? I cannot seem to use concurrent downloading by passing the async=TRUE option, which gives me the same error every time:

Error in function (type, msg, asError = TRUE)  : 
Failed to connect to 0.0.0.12: No route to host

Any ideas? Thanks!

like image 246
Fr. Avatar asked Apr 09 '12 02:04

Fr.


1 Answers

Try mclapply {multicore} instead of lapply.

"mclapply is a parallelized version of lapply, it returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X." (http://www.rforge.net/doc/packages/multicore/mclapply.html)

If that doesn't work, you may get better performance using the XML package. Functions like xmlTreeParse use asynchronous calling.

"Note that xmlTreeParse does allow a hybrid style of processing that allows us to apply handlers to nodes in the tree as they are being converted to R objects. This is a style of event-driven or asynchronous calling." (http://www.inside-r.org/packages/cran/XML/docs/xmlEventParse)

like image 118
rsoren Avatar answered Oct 18 '22 04:10

rsoren