I am trying to scrape a large amount of web pages to later analyse them. Since the number of URLs is huge, I had decided to use the <code>parallel</code> package along with <code>XML</code>. Specifically, I am using the <code>htmlParse()</code> function from <code>XML</code>, which works fine when used with <code>sapply</code>, but generates empty objects of class HTMLInternalDocument when used with <code>parSapply</code>. <pre class="prettyprint"><code>url1<- "http://forums.philosophyforums.com/threads/senses-of-truth-63636.html" url2<- "http://forums.philosophyforums.com/threads/the-limits-of-my-language-impossibly-mean-the-limits-of-my-world-62183.html" url3<- "http://forums.philosophyforums.com/threads/how-language-models-reality-63487.html" myFunction<- function(x){ cl<- makeCluster(getOption("cl.cores",detectCores())) ok<- parSapply(cl=cl,X=x,FUN=htmlParse) return(ok) } urls<- c(url1,url2,url3) #Works output1<- sapply(urls,function(x)htmlParse(x)) str(output1[[1]]) > Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr> output1[[1]] #Doesn't work myFunction<- function(x){ cl<- makeCluster(getOption("cl.cores",detectCores())) ok<- parSapply(cl=cl,X=x,FUN=htmlParse) stopCluster(cl) return(ok) } output2<- myFunction(urls) str(output2[[1]]) > Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr> output2[[1]] #empty </code></pre> Thanks.

You can use <code>getURIAsynchronous</code> from Rcurl package that allows the caller to specify multiple URIs to download at the same time. <pre class="prettyprint"><code>library(RCurl) library(XML) get.asynch <- function(urls){ txt <- getURIAsynchronous(urls) ## this part can be easily parallelized ## I am juste using lapply here as first attempt res <- lapply(txt,function(x){ doc <- htmlParse(x,asText=TRUE) xpathSApply(doc,"/html/body/h2[2]",xmlValue) }) } get.synch <- function(urls){ lapply(urls,function(x){ doc <- htmlParse(x) res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue) res2 })} </code></pre> Here some benchmarking for 100 urls you divide the parsing time by a factor of 2. <pre class="prettyprint"><code>library(microbenchmark) uris = c("http://www.omegahat.org/RCurl/index.html") urls <- replicate(100,uris) microbenchmark(get.asynch(urls),get.synch(urls),times=1) Unit: seconds expr min lq median uq max neval get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783 1 get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615 1 </code></pre>

Using parallelisation to scrape web pages with R

Tags:

r

xml

parallel-processing

I am trying to scrape a large amount of web pages to later analyse them. Since the number of URLs is huge, I had decided to use the parallel package along with XML.

Specifically, I am using the htmlParse() function from XML, which works fine when used with sapply, but generates empty objects of class HTMLInternalDocument when used with parSapply.

url1<- "http://forums.philosophyforums.com/threads/senses-of-truth-63636.html"
url2<- "http://forums.philosophyforums.com/threads/the-limits-of-my-language-impossibly-mean-the-limits-of-my-world-62183.html"
url3<- "http://forums.philosophyforums.com/threads/how-language-models-reality-63487.html"

myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
return(ok)
}

urls<- c(url1,url2,url3)

#Works
output1<- sapply(urls,function(x)htmlParse(x))
str(output1[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output1[[1]]


#Doesn't work
myFunction<- function(x){
cl<- makeCluster(getOption("cl.cores",detectCores()))
ok<- parSapply(cl=cl,X=x,FUN=htmlParse)
stopCluster(cl)
return(ok)
}

output2<- myFunction(urls)
str(output2[[1]])
> Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument', 'oldClass' <externalptr>
output2[[1]]
#empty

Thanks.

608

asked Oct 20 '13 08:10

info_seekeR

1 Answers

You can use getURIAsynchronous from Rcurl package that allows the caller to specify multiple URIs to download at the same time.

library(RCurl)
library(XML)
get.asynch <- function(urls){
  txt <- getURIAsynchronous(urls)
  ## this part can be easily parallelized 
  ## I am juste using lapply here as first attempt
  res <- lapply(txt,function(x){
    doc <- htmlParse(x,asText=TRUE)
    xpathSApply(doc,"/html/body/h2[2]",xmlValue)
  })
}

get.synch <- function(urls){
  lapply(urls,function(x){
    doc <- htmlParse(x)
    res2 <- xpathSApply(doc,"/html/body/h2[2]",xmlValue)
    res2
  })}

Here some benchmarking for 100 urls you divide the parsing time by a factor of 2.

library(microbenchmark)
uris = c("http://www.omegahat.org/RCurl/index.html")
urls <- replicate(100,uris)
microbenchmark(get.asynch(urls),get.synch(urls),times=1)

Unit: seconds
             expr      min       lq   median       uq      max neval
 get.asynch(urls) 22.53783 22.53783 22.53783 22.53783 22.53783     1
  get.synch(urls) 39.50615 39.50615 39.50615 39.50615 39.50615     1

answered Oct 07 '22 19:10

agstudy

Related questions
                            
                                Nokogiri and XML Formatting When Inserting Tags
                            
                                allow users to create forms within android survey/data collection app
                            
                                Current state of client-side XSLT
                            
                                XSLT: Move node one level up
                            
                                Surgical XML editing with Powershell
                            
                                what's the fastest way to write XML
                            
                                How to prevent XPath/XML injection in .NET
                            
                                HTML Agility Pack Find ids starting with
                            
                                group by multiple attributes from xml with xslt
                            
                                Saw <?var type="string" ?> in an XML string, but what does this mean?
                            
                                JAXB generic @XmlValue
                            
                                How to use a colon (":") in a Nokogiri node name
                            
                                Not all parameters in WCF data contract make it through the web service call
                            
                                Visual Studio 2012 T4 templates generating XML gives error
                            
                                How is an xml column stored in SQL Server 2008?
                            
                                XML serialization of a list with attributes
                            
                                IntelliSense: namespace "MSXML2" has no member "DOMDocument" in VS2012
                            
                                Serialize multiple objects
                            
                                How to serialize a null string as a single empty tag?
                            
                                Is there an xsd schema for xsd schemas? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With