Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using xpathSApply to scrape XML attributes in R

Tags:

r

xml

xpath

I am scraping XML in R using xpathSApply (in the XML package) and having trouble pulling attributes out.

First, a relevant snippet of XML:

 <div class="offer-name">
        <a href="http://www.somesite.com" itemprop="name">Fancy Product</a>
      </div>

I have successfully pulled the 'Fancy Product' (i.e. element?) using:

Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue) 

That took some time (I'm a n00b), but the documentation is good and there are several answered questions here I was able to leverage. I can't figure out how to pull the "http://www.somesite.com" out though (attribute?). I've speculated that it involves changing the 3rd term from 'xmlValue' to 'xmlGetAttr' but I could be totally off.

FYI (1) There are 2 more parent < div> above the snippet I pasted and (2) here is the abbreviated complete-ish code (which I don't think is relevant but included for the sake of completeness) is:

library(XML)
library(httr)

content2 = paste(readLines(file.choose()), collapse = "\n") # User will select file.
parsedHTML = htmlParse(content2,asText=TRUE)

Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue) 
like image 757
Tom Avatar asked Aug 14 '14 18:08

Tom


2 Answers

The href is an attribute. You can select the appropriate node //div/a and use the xmlGetAttr function with name = href:

'<div class="offer-name">
  <a href="http://www.somesite.com" itemprop="name">Fancy Product</a>
  </div>' -> xData
library(XML)
parsedHTML <- xmlParse(xData)
Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue) 
hrefs <- xpathSApply(parsedHTML, "//div/a", xmlGetAttr, 'href')
> hrefs
[1] "http://www.somesite.com"
like image 84
jdharrison Avatar answered Nov 01 '22 19:11

jdharrison


You can also do this directly using XPath, without using xpathSApply(...).

xData <- '<div class="offer-name">
  <a href="http://www.somesite.com" itemprop="name">Fancy Product</a>
  </div>'
library(XML)
parsedHTML <- xmlParse(xData)
hrefs <- unlist(parsedHTML["//div[@class='offer-name']/a/@href"])
hrefs
#                      href 
# "http://www.somesite.com" 
like image 25
jlhoward Avatar answered Nov 01 '22 18:11

jlhoward