Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RCurl does not retrieve the full source text of website - links missing?

I would like to use RCurl as a polite webcrawler to download data from a website. Obviously I need the data for scientific research. Although I have the rights to access the content of the website via my university, the terms of use of the website forbid the use of webcrawlers.

I tried to ask the administrator of the site directly for the data but they only replied in a very vague fashion. Well anyway it seems like they won’t simply send the underlying databases to me.

What I want to do now is ask them officially to get the one-time permission to download specific text-only content from their site using an R code based on RCurl that includes a delay of three seconds after each request has been executed.

The address of the sites that I want to download data from work like this: http://plants.jstor.org/specimen/ID of the site

I tried to program it with RCurl but I cannot get it done. A few things complicate things:

  1. One can only access the website if cookies are allowed (I got that working in RCurl with the cookiefile-argument).

  2. The Next-button only appears in the source code when one actually accesses the site by clicking through the different links in a normal browser. In the source code the Next-button is encoded with an expression including

    <a href="/.../***ID of next site***">Next &gt; &gt; </a>
    

    When one tries to access the site directly (without having clicked through to it in the same browser before), it won't work, the line with the link is simply not in the source code.

  3. The IDs of the sites are combinations of letters and digits (like “goe0003746” or “cord00002203”), so I can't simply write a for-loop in R that tries every number from 1 to 1,000,000.

So my program is supposed to mimic a person that clicks through all the sites via the Next-button, each time saving the textual content.

Each time after saving the content of a site, it should wait three seconds before clicking on the Next-button (it must be a polite crawler). I got that working in R as well using the Sys.sleep function.

I also thought of using an automated program, but there seem to be a lot of such programs and I don’t know which one to use.

I’m also not exactly the program-writing person (apart from a little bit of R), so I would really appreciate a solution that doesn’t include programming in Python, C++, PHP or the like.

Any thoughts would be much appreciated! Thank you very much in advance for comments and proposals !!

like image 344
user1012744 Avatar asked Nov 28 '25 23:11

user1012744


1 Answers

Try a different strategy.

 ##########################
 ####
 ####            Scrape http://plants.jstor.org/specimen/
 ####        Idea:: Gather links from http://plants.jstor.org/search?t=2076
 ####            Then follow links:
 ####
 #########################

 library(RCurl)
 library(XML)

 ### get search page::

 cookie = 'cookiefile.txt'
 curl  =  getCurlHandle ( cookiefile = cookie , 
     useragent =  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
     header = F,
     verbose = TRUE,
     netrc = TRUE,
     maxredirs = as.integer(20),
     followlocation = TRUE)

 querry.jstor <- getURL('http://plants.jstor.org/search?t=2076', curl = curl)

 ## remove white spaces:
 querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))

 ### get links from search page
  getLinks = function() {
        links = character()
        list(a = function(node, ...) {
                    links <<- c(links, xmlGetAttr(node, "href"))
                    node
                 },
             links = function()links)
      }

 ## retrieve links
  querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)

 ## cleanup links to keep only the one we want. 
  querry.jstor.links = NULL
  querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
  querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
  querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
  querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
  querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
  querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links

 ## number of results
  jstor.article <- getNodeSet(htmlTreeParse(querry.jstor2, useInt=T), "//article")
  NumOfRes <- strsplit(gsub(',', '', gsub(' ', '' ,xmlValue(jstor.article[[1]][[1]]))), split='')[[1]]
  NumOfRes <- as.numeric(paste(NumOfRes[1:min(grep('R', NumOfRes))-1], collapse = ''))

  for(i in 2:ceiling(NumOfRes/20)){
    querry.jstor <- getURL('http://plants.jstor.org/search?t=2076&p=',i, curl = curl)
    ## remove white spaces:
    querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))
    querry.jstor.xml.parsed <- htmlTreeParse(querry.jstor2, useInt=T, handlers = h1)
    querry.jstor.links <- c(querry.jstor.links, querry.jstor.xml.parsed$links()[-grep('http', querry.jstor.xml.parsed$links())]) ## remove all links starting with http
    querry.jstor.links <- querry.jstor.links[-grep('search', querry.jstor.links)] ## remove all search links
    querry.jstor.links <- querry.jstor.links[-grep('#', querry.jstor.links)] ## remove all # links
    querry.jstor.links <- querry.jstor.links[-grep('javascript', querry.jstor.links)] ## remove all javascript links
    querry.jstor.links <- querry.jstor.links[-grep('action', querry.jstor.links)] ## remove all action links
    querry.jstor.links <- querry.jstor.links[-grep('page', querry.jstor.links)] ## remove all page links

    Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5))) 
  }

  ## make directory for saving data: 
  dir.create('./jstorQuery/')

  ## Now we have all the links, so we can retrieve all the info
  for(j in 1:length(querry.jstor.links)){
    if(nchar(querry.jstor.links[j]) != 1){
       querry.jstor <- getURL('http://plants.jstor.org',querry.jstor.links[j], curl = curl)
       ## remove white spaces:
       querry.jstor2 <- gsub('\r','', gsub('\t','', gsub('\n','', querry.jstor)))

       ## contruct name:
       filename = querry.jstor.links[j][grep( '/', querry.jstor.links[j])+1 : nchar( querry.jstor.links[j])]

       ## save in directory: 
       write(querry.jstor2, file = paste('./jstorQuery/', filename, '.html', sep = '' ))

       Sys.sleep(abs(rnorm(1, mean=3.0, sd=0.5))) 
    }
  }
like image 152
Mischa Vreeburg Avatar answered Dec 01 '25 11:12

Mischa Vreeburg