I took the following code from the rNomads package and modified it a little bit.
When initially running it I get:
> WebCrawler(url = "www.bikeforums.net")
[1] "www.bikeforums.net"
[1] "www.bikeforums.net"
Warning message:
XML content does not seem to be XML: 'www.bikeforums.net'
Here is the code:
require("XML")
# cleaning workspace
rm(list = ls())
# This function recursively searches for links in the given url and follows every single link.
# It returns a list of the final (dead end) URLs.
# depth - How many links to return. This avoids having to recursively scan hundreds of links. Defaults to NULL, which returns everything.
WebCrawler <- function(url, depth = NULL, verbose = TRUE) {
doc <- XML::htmlParse(url)
links <- XML::xpathSApply(doc, "//a/@href")
XML::free(doc)
if(is.null(links)) {
if(verbose) {
print(url)
}
return(url)
} else {
urls.out <- vector("list", length = length(links))
for(link in links) {
if(!is.null(depth)) {
if(length(unlist(urls.out)) >= depth) {
break
}
}
urls.out[[link]] <- WebCrawler(link, depth = depth, verbose = verbose)
}
return(urls.out)
}
}
# Execution
WebCrawler(url = "www.bikeforums.net")
Any recommendation what I am doing wrong?
UPDATE
Hello guys,
I started this bounty, because I think in the R community there is need for such a function, which can crawl webpages. The solution, which would win the bounty should show a function which takes two parameters:
WebCrawler(url = "www.bikeforums.net", xpath = "\\title" )
I really appreciate your replies
Insert the following code under links <- XML::xpathSApply(doc, "//a/@href")
in your function.
links <- XML::xpathSApply(doc, "//a/@href")
links1 <- links[grepl("http", links)] # As @Floo0 pointed out this is to capture non relative links
links2 <- paste0(url, links[!grepl("http", links)]) # and to capture relative links
links <- c(links1, links2)
And also remember to have the url
as http://www......
Also you are not updating your urls.out
list. As you have it, it is always going to be an empty list of length the same as the length as links
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With