I am trying to write code that will go to each page and take information from there. Url <- http://www.wikiart.org/en/claude-monet/mode/all-paintings-by-alphabet
I have code to output all hrefs. But it doesn't work.
library(XML)
library(RCurl)
library(stringr)
tagrecode <- readHTMLTable ("http://www.wikiart.org/en/claude-monet/mode/all- paintings-by-alphabet")
tabla <- as.data.frame(tagrecode)
str(tabla)
names (tabla) <- c("name", "desc", "cat", "updated")
str(tabla)
res <- htmlParse ("http://www.wikiart.org/en/claude-monet/mode/all-paintings-by- alphabet")
enlaces <- getNodeSet (res, "//p[@class='pb5']/a/@href")
enlaces <- unlist(lapply(enlaces, as.character))
tabla$enlace <- paste("http://www.wikiart.org/en/claude-monet/mode/all-paintings-by- alphabet")
str(tabla)
lisurl <- tabla$enlace
fu1 <- function(url){
print(url)
pas1 <- htmlParse(url, useInternalNodes=T)
pas2 <- xpathSApply(pas1, "//p[@class='pb5']/a/@href")
}
urldef <- lapply(lisurl,fu1)
After i have list of the urls of all pictures on this page i want to go to the second-third-...-23 pages to collect urls of all pictures.
Next step- to scrap info about every picture. I have working code for one and i need to build it in one general code.
library(XML)
url = "http://www.wikiart.org/en/claude-monet/camille-and-jean-monet-in-the-garden-at-argenteuil"
doc = htmlTreeParse(url, useInternalNodes=T)
pictureName <- xpathSApply(doc,"//h1[@itemprop='name']", xmlValue)
date <- xpathSApply(doc, "//span[@itemprop='dateCreated']", xmlValue)
author <- xpathSApply(doc, "//a[@itemprop='author']", xmlValue)
style <- xpathSApply(doc, "//span[@itemprop='style']", xmlValue)
genre <- xpathSApply(doc, "//span[@itemprop='genre']", xmlValue)
pictureName
date
author
style
genre
Every advise how to do this will be appreciated!
This seems to work.
library(XML)
library(httr)
url <- "http://www.wikiart.org/en/claude-monet/mode/all-paintings-by-alphabet/"
hrefs <- list()
for (i in 1:23) {
response <- GET(paste0(url,i))
doc <- content(response,type="text/html")
hrefs <- c(hrefs,doc["//p[@class='pb5']/a/@href"])
}
url <- "http://www.wikiart.org"
xPath <- c(pictureName = "//h1[@itemprop='name']",
date = "//span[@itemprop='dateCreated']",
author = "//a[@itemprop='author']",
style = "//span[@itemprop='style']",
genre = "//span[@itemprop='genre']")
get.picture <- function(href) {
response <- GET(paste0(url,href))
doc <- content(response,type="text/html")
info <- sapply(xPath,function(xp)ifelse(length(doc[xp])==0,NA,xmlValue(doc[xp][[1]])))
}
pictures <- do.call(rbind,lapply(hrefs,get.picture))
head(pictures)
# pictureName date author style genre
# [1,] "A Corner of the Garden at Montgeron" "1877" "Claude Monet" "Impressionism" "landscape"
# [2,] "A Corner of the Studio" "1861" "Claude Monet" "Realism" "self-portrait"
# [3,] "A Farmyard in Normandy" "c.1863" "Claude Monet" "Realism" "landscape"
# [4,] "A Windmill near Zaandam" NA "Claude Monet" "Impressionism" "landscape"
# [5,] "A Woman Reading" "1872" "Claude Monet" "Impressionism" "genre painting"
# [6,] "Adolphe Monet Reading in the Garden" "1866" "Claude Monet" "Impressionism" "genre painting"
You were actually pretty close. Your xPath is fine; one problem is that not all of the pictures have all of the information (e.g., for some of the pages the nodeSets your are trying to access are empty) - note the date for "A Windnill nead Zaandam". So the code has to deal with this possibility.
So in this example, the first loop grabs the values of the href attribute of the anchor tags for each page (1:23) and combines these into a vector of length ~1300.
To process each of these 1300 pages, and since we have to deal with missing tags, it's more straightforward to create a vector containing the xPath strings and apply that element-wise to each page. That's what the function get.picture(...)
does. The last statement calls this function with each of the 1300 hrefs, and binds the result together row-wise, using do.call(rbind,...)
.
Note also that this code uses the somewhat more compact indexing feature for objects of class HTMLInternalDocument: doc[xpath]
where xpath
is an xPath string. This avoids the use of xpathSApply(...)
, although the latter would have worked.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With