Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R WebCrawler - XML content does not seem to be XML:

Tags:

r

xml

statistics

I took the following code from the rNomads package and modified it a little bit.

When initially running it I get:

> WebCrawler(url = "www.bikeforums.net")
[1] "www.bikeforums.net"
[1] "www.bikeforums.net"

Warning message:
XML content does not seem to be XML: 'www.bikeforums.net' 

Here is the code:

require("XML")

# cleaning workspace
rm(list = ls())

# This function recursively searches for links in the given url and follows every single link.
# It returns a list of the final (dead end) URLs.
# depth - How many links to return. This avoids having to recursively scan hundreds of links. Defaults to NULL, which returns everything.
WebCrawler <- function(url, depth = NULL, verbose = TRUE) {

  doc <- XML::htmlParse(url)
  links <- XML::xpathSApply(doc, "//a/@href")
  XML::free(doc)
  if(is.null(links)) {
    if(verbose) {
      print(url)
    }
    return(url)
  } else {
    urls.out <- vector("list", length = length(links))
    for(link in links) {
      if(!is.null(depth)) {
        if(length(unlist(urls.out)) >= depth) {
          break
        }
      }
      urls.out[[link]] <- WebCrawler(link, depth = depth, verbose = verbose)
    }
    return(urls.out)
  }
}


# Execution
WebCrawler(url = "www.bikeforums.net")

Any recommendation what I am doing wrong?

UPDATE

Hello guys,

I started this bounty, because I think in the R community there is need for such a function, which can crawl webpages. The solution, which would win the bounty should show a function which takes two parameters:

WebCrawler(url = "www.bikeforums.net", xpath = "\\title" )
  • As output I would like to have a data frame with two columns: the website link and if the example xpath expression matches a column with the matched expression.

I really appreciate your replies

like image 767
Carol.Kar Avatar asked Apr 18 '15 13:04

Carol.Kar


1 Answers

Insert the following code under links <- XML::xpathSApply(doc, "//a/@href") in your function.

links <- XML::xpathSApply(doc, "//a/@href")
links1 <- links[grepl("http", links)] # As @Floo0 pointed out this is to capture non relative links
links2 <- paste0(url, links[!grepl("http", links)]) # and to capture relative links
links <- c(links1, links2)

And also remember to have the url as http://www......

Also you are not updating your urls.out list. As you have it, it is always going to be an empty list of length the same as the length as links

like image 199
dimitris_ps Avatar answered Nov 12 '22 18:11

dimitris_ps