I want to do parallel computing using foreach instead of for loop, but I don't really know how.
So, what I want to do is to get plain texts from a bunch of webpages, and I have over 3000 links to work on. I need to put all of the texts into a single big file.
I know that for loop would work but it would take a long time that I don't even bother to try it.
My question is then, how to convert the for loop into foreach?
Here is my for loop:
library(RCurl)
library(XML)
urls <- scan("file", what = "char", quote = "", sep = "\n") #a vector that contains 3000+ urls
corpus=character()
for (i in 1:length(urls)) { #a loop that the following function will operate on each link
html=getURL(urls[i],followlocation=T)
doc=htmlParse(html,asText=T)
text=xpathSApply(doc,"//p",xmlValue)
corpus=append(corpus,text) #append all the individual text
}
Here is a version using mclapply:
library(RCurl)
library(XML)
library(parallel)
urls <- c('http://www.google.com', 'http://stackoverflow.com')
corpi <- mclapply(urls, function(url) {
html=getURL(url, followlocation=T)
doc=htmlParse(html,asText=T)
return(xpathSApply(doc,"//p",xmlValue))
}, mc.cores=2)
and here with foreach and doMC:
library(RCurl)
library(XML)
library(doMC)
registerDoMC(cores=2)
urls <- c('http://www.google.com', 'http://stackoverflow.com')
corpi <- foreach(url=urls, .combine=c) %dopar% {
html=getURL(url, followlocation=T)
doc=htmlParse(html,asText=T)
return(xpathSApply(doc,"//p",xmlValue))
}
Foreach might be a bit easier to use if you are new to the apply functions. Both should work on OS X and Linux, but not on Windows.
The correct approach is to perform concurrent requests. Take a look at: ?getURIAsynchronous http://www.omegahat.org/RCurl/concurrent.html This approach however doesn´t always work, but it's worth a try.
library(RCurl)
urls = c('http://stackoverflow.com/questions/22087072/parallel-computing-using-foreach',
'http://www.omegahat.org/RCurl/FAQ.html',
'http://www.omegahat.org/RCurl/RCurlJSS.pdf')
opts = list(timeout = 1, maxredirs = 2,
verbose = FALSE, followLocation = TRUE)
ls_uris <- list()
for(i in seq_along(urls)){
print(i)
ls_uris[[i]] <- getURIAsynchronous(urls[[i]], .opts=opts)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With