r - xpathApply on XMLNodeSet (with XML package)

Question

I am trying to use xpathApply function in XML package in R to extract certain data from a html file. However, after I use xpathApply on some parent nodes of the html document, the class of the resulting object becomes XMLNodeSet, and I cannot further use xpathApply on such object, as this error message appears: “Error in UseMethod("xpathApply") : no applicable method for 'xpathApply' applied to an object of class "XMLNodeSet"”

Here is the R script I am trying to replicate my problem (this example is just a simple table, I know I can use readHTMLtable function, but I need to use more low level function to work because my actual html is more complicated than this simple table):

library(XML)
y <- htmlParse(htmlfile)
x <- xpathApply(y, "//table/tr")
z <- xpathApply(x, "/td")

Here is the “htmlfile”:

<table>
<tr>
<td> Test1.1 </td> <td> Test1.2 </td>
</tr>
<tr>
<td> Test1.3 </td> <td> Test1.4 </td>
</tr>
</table>

Is there any method to further work on the nodes after using xpathApply? Or are there other good alternatives to play around the data in the nodes?

agstudy · Accepted Answer

Once you have a list of node you can apply function on itto extract the node. Function like xmlValue or xmlGetAttr.... For example :

x <- xpathApply(y, "//table/tr")
sapply(x,xmlValue)          ## it a list of nodes..
 " Test1.1  Test1.2 " " Test1.3  Test1.4 "

Which is equivalent to do :

xpathSApply(y,"//table/tr",xmlValue)
" Test1.1  Test1.2 " " Test1.3  Test1.4 "

EDIT

I am sure that you question can be solved by the right xpath. You should learn work with xml files as you work with a data base . xpath is just analogous to an sql query. it is fast and many browsers can help you to get the generate the right xpath.

For example :

 xpathSApply(y,"//table/tr[2]/td[1]",xmlValue) #  second tr and first td
 [1] " Test1.3 "
 xpathSApply(y,"//table/tr[2]/td[3]",xmlValue) #  second tr and third td

EDIT

THE OP looks like if he wantes to replicate the XML structure ( get tr and td in the same order)

here is the way , I don't think is the more efficient way ...

nn.trs <- length(xpathSApply(y,"//table/tr",I))
lapply(seq(nn.trs),function(i){
       xpathSApply(y,paste("//table/tr[",i,"]/td",sep=''),xmlValue)
})
[[1]]
[1] " Test1.1 " " Test1.2 "

[[2]]
[1] " Test1.3 " " Test1.4 "

If if number of td are all same in each tr, you can replace lapply by sapply and you get :

    [,1]        [,2]       
[1,] " Test1.1 " " Test1.3 "
[2,] " Test1.2 " " Test1.4 "

But I think that in this case readHtmlTable is better..

c0bra · Answer

Although the solution of defining the right xPath seems to be better you can do this:

library(XML)
y <- htmlParse(htmlfile)
x <- getNodeSet(y, "//table/tr")
z <- lapply(x, function(x){
                 subDoc <- xmlDoc(x)
                 r <- xpathApply(x, "/td")
                 free(subDoc) # not sure if necessary
                 return(r)
})

CHP · Answer

Following seem to be working. Essentially you have to search the elements of the list returned by xpathApply

> y <- htmlParse(htmlfile)
> x <- xpathApply(y, "//table/tr")
> x
[[1]]
<tr><td> Test1.1 </td> <td> Test1.2 </td>
</tr> 

[[2]]
<tr><td> Test1.3 </td> <td> Test1.4 </td>
</tr> 

attr(,"class")
[1] "XMLNodeSet"
> z <- xpathApply(x[[1]], "//td")
> z
[[1]]
<td> Test1.1 </td> 

[[2]]
<td> Test1.2 </td> 

[[3]]
<td> Test1.3 </td> 

[[4]]
<td> Test1.4 </td> 

attr(,"class")
[1] "XMLNodeSet"

PS: I am not sure why it searches all elements of x list rather than just x[[1]]. Seems like a bug.

r - xpathApply on XMLNodeSet (with XML package)

Tags:

html

r

web-scraping

Joyce

3 Answers

agstudy

c0bra

CHP

Recent Activity

Donate For Us

r - xpathApply on XMLNodeSet (with XML package)

Tags:

html

r

web-scraping

Joyce

3 Answers

agstudy

c0bra

CHP

Related questions

Recent Activity

Donate For Us