Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How write code to web crawling and scraping in R

I am trying to write code that will go to each page and take information from there. Url <- http://www.wikiart.org/en/claude-monet/mode/all-paintings-by-alphabet

I have code to output all hrefs. But it doesn't work.

library(XML)
library(RCurl)
library(stringr)
tagrecode <- readHTMLTable ("http://www.wikiart.org/en/claude-monet/mode/all-            paintings-by-alphabet")
tabla <- as.data.frame(tagrecode)
str(tabla)
names (tabla) <- c("name", "desc", "cat", "updated")
str(tabla)
res <- htmlParse ("http://www.wikiart.org/en/claude-monet/mode/all-paintings-by- alphabet")
enlaces <- getNodeSet (res, "//p[@class='pb5']/a/@href")
enlaces <- unlist(lapply(enlaces, as.character))
tabla$enlace <- paste("http://www.wikiart.org/en/claude-monet/mode/all-paintings-by- alphabet")
str(tabla)
lisurl <- tabla$enlace

fu1 <- function(url){
print(url)
pas1 <- htmlParse(url, useInternalNodes=T)

pas2 <- xpathSApply(pas1, "//p[@class='pb5']/a/@href")
}
urldef <- lapply(lisurl,fu1)

After i have list of the urls of all pictures on this page i want to go to the second-third-...-23 pages to collect urls of all pictures.

Next step- to scrap info about every picture. I have working code for one and i need to build it in one general code.

library(XML)
url = "http://www.wikiart.org/en/claude-monet/camille-and-jean-monet-in-the-garden-at-argenteuil"
doc = htmlTreeParse(url, useInternalNodes=T)
pictureName <- xpathSApply(doc,"//h1[@itemprop='name']", xmlValue)
date <- xpathSApply(doc, "//span[@itemprop='dateCreated']", xmlValue)
author <- xpathSApply(doc, "//a[@itemprop='author']", xmlValue)
style <- xpathSApply(doc, "//span[@itemprop='style']", xmlValue)
genre <- xpathSApply(doc, "//span[@itemprop='genre']", xmlValue)

pictureName
date
author
style
genre

Every advise how to do this will be appreciated!

like image 215
user3793981 Avatar asked Jul 04 '14 14:07

user3793981


1 Answers

This seems to work.

library(XML)
library(httr)
url <- "http://www.wikiart.org/en/claude-monet/mode/all-paintings-by-alphabet/"
hrefs <- list()
for (i in 1:23) {
  response <- GET(paste0(url,i))
  doc      <- content(response,type="text/html")
  hrefs    <- c(hrefs,doc["//p[@class='pb5']/a/@href"])
}
url      <- "http://www.wikiart.org"
xPath    <- c(pictureName = "//h1[@itemprop='name']",
              date        = "//span[@itemprop='dateCreated']",
              author      = "//a[@itemprop='author']",
              style       = "//span[@itemprop='style']",
              genre       = "//span[@itemprop='genre']")
get.picture <- function(href) {
  response <- GET(paste0(url,href))
  doc      <- content(response,type="text/html")
  info     <- sapply(xPath,function(xp)ifelse(length(doc[xp])==0,NA,xmlValue(doc[xp][[1]])))
}
pictures <- do.call(rbind,lapply(hrefs,get.picture))
head(pictures)
#      pictureName                           date     author         style           genre           
# [1,] "A Corner of the Garden at Montgeron" "1877"   "Claude Monet" "Impressionism" "landscape"     
# [2,] "A Corner of the Studio"              "1861"   "Claude Monet" "Realism"       "self-portrait" 
# [3,] "A Farmyard in Normandy"              "c.1863" "Claude Monet" "Realism"       "landscape"     
# [4,] "A Windmill near Zaandam"             NA       "Claude Monet" "Impressionism" "landscape"     
# [5,] "A Woman Reading"                     "1872"   "Claude Monet" "Impressionism" "genre painting"
# [6,] "Adolphe Monet Reading in the Garden" "1866"   "Claude Monet" "Impressionism" "genre painting"

You were actually pretty close. Your xPath is fine; one problem is that not all of the pictures have all of the information (e.g., for some of the pages the nodeSets your are trying to access are empty) - note the date for "A Windnill nead Zaandam". So the code has to deal with this possibility.

So in this example, the first loop grabs the values of the href attribute of the anchor tags for each page (1:23) and combines these into a vector of length ~1300.

To process each of these 1300 pages, and since we have to deal with missing tags, it's more straightforward to create a vector containing the xPath strings and apply that element-wise to each page. That's what the function get.picture(...) does. The last statement calls this function with each of the 1300 hrefs, and binds the result together row-wise, using do.call(rbind,...).

Note also that this code uses the somewhat more compact indexing feature for objects of class HTMLInternalDocument: doc[xpath] where xpath is an xPath string. This avoids the use of xpathSApply(...), although the latter would have worked.

like image 147
jlhoward Avatar answered Oct 21 '22 17:10

jlhoward