Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallel computing using foreach for web scraping

I want to do parallel computing using foreach instead of for loop, but I don't really know how.

So, what I want to do is to get plain texts from a bunch of webpages, and I have over 3000 links to work on. I need to put all of the texts into a single big file.

I know that for loop would work but it would take a long time that I don't even bother to try it.

My question is then, how to convert the for loop into foreach?

Here is my for loop:

library(RCurl)  
library(XML)  
urls <- scan("file", what = "char", quote = "", sep = "\n")  #a vector that contains 3000+ urls  
corpus=character()                              
for (i in 1:length(urls)) {      #a loop that the following function will operate on each link 
  html=getURL(urls[i],followlocation=T)  
  doc=htmlParse(html,asText=T)
  text=xpathSApply(doc,"//p",xmlValue)      
  corpus=append(corpus,text)       #append all the individual text  
}
like image 886
charlotte Avatar asked Dec 22 '25 15:12

charlotte


2 Answers

Here is a version using mclapply:

library(RCurl)  
library(XML)  
library(parallel)
urls <- c('http://www.google.com', 'http://stackoverflow.com') 

corpi <- mclapply(urls, function(url) {
    html=getURL(url, followlocation=T)  
    doc=htmlParse(html,asText=T)
    return(xpathSApply(doc,"//p",xmlValue))
}, mc.cores=2)

and here with foreach and doMC:

library(RCurl)  
library(XML)  
library(doMC)
registerDoMC(cores=2)

urls <- c('http://www.google.com', 'http://stackoverflow.com') 

corpi <- foreach(url=urls, .combine=c) %dopar% {
    html=getURL(url, followlocation=T)  
    doc=htmlParse(html,asText=T)
    return(xpathSApply(doc,"//p",xmlValue))
}

Foreach might be a bit easier to use if you are new to the apply functions. Both should work on OS X and Linux, but not on Windows.

like image 62
Paul Staab Avatar answered Dec 24 '25 03:12

Paul Staab


The correct approach is to perform concurrent requests. Take a look at: ?getURIAsynchronous http://www.omegahat.org/RCurl/concurrent.html This approach however doesn´t always work, but it's worth a try.

library(RCurl)

urls = c('http://stackoverflow.com/questions/22087072/parallel-computing-using-foreach',
         'http://www.omegahat.org/RCurl/FAQ.html', 
         'http://www.omegahat.org/RCurl/RCurlJSS.pdf')

opts = list(timeout = 1, maxredirs = 2,
            verbose = FALSE, followLocation = TRUE)

ls_uris <- list()
for(i in seq_along(urls)){
  print(i)
  ls_uris[[i]] <- getURIAsynchronous(urls[[i]], .opts=opts)
}
like image 32
marbel Avatar answered Dec 24 '25 05:12

marbel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!