Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r - xpathApply on XMLNodeSet (with XML package)

I am trying to use xpathApply function in XML package in R to extract certain data from a html file. However, after I use xpathApply on some parent nodes of the html document, the class of the resulting object becomes XMLNodeSet, and I cannot further use xpathApply on such object, as this error message appears: “Error in UseMethod("xpathApply") : no applicable method for 'xpathApply' applied to an object of class "XMLNodeSet"”

Here is the R script I am trying to replicate my problem (this example is just a simple table, I know I can use readHTMLtable function, but I need to use more low level function to work because my actual html is more complicated than this simple table):

library(XML)
y <- htmlParse(htmlfile)
x <- xpathApply(y, "//table/tr")
z <- xpathApply(x, "/td")

Here is the “htmlfile”:

<table>
<tr>
<td> Test1.1 </td> <td> Test1.2 </td>
</tr>
<tr>
<td> Test1.3 </td> <td> Test1.4 </td>
</tr>
</table>

Is there any method to further work on the nodes after using xpathApply? Or are there other good alternatives to play around the data in the nodes?

like image 843
Joyce Avatar asked Feb 19 '13 12:02

Joyce


3 Answers

Once you have a list of node you can apply function on itto extract the node. Function like xmlValue or xmlGetAttr.... For example :

x <- xpathApply(y, "//table/tr")
sapply(x,xmlValue)          ## it a list of nodes..
 " Test1.1  Test1.2 " " Test1.3  Test1.4 "

Which is equivalent to do :

xpathSApply(y,"//table/tr",xmlValue)
" Test1.1  Test1.2 " " Test1.3  Test1.4 "

EDIT

I am sure that you question can be solved by the right xpath. You should learn work with xml files as you work with a data base . xpath is just analogous to an sql query. it is fast and many browsers can help you to get the generate the right xpath.

For example :

 xpathSApply(y,"//table/tr[2]/td[1]",xmlValue) #  second tr and first td
 [1] " Test1.3 "
 xpathSApply(y,"//table/tr[2]/td[3]",xmlValue) #  second tr and third td

EDIT

THE OP looks like if he wantes to replicate the XML structure ( get tr and td in the same order)

here is the way , I don't think is the more efficient way ...

nn.trs <- length(xpathSApply(y,"//table/tr",I))
lapply(seq(nn.trs),function(i){
       xpathSApply(y,paste("//table/tr[",i,"]/td",sep=''),xmlValue)
})
[[1]]
[1] " Test1.1 " " Test1.2 "

[[2]]
[1] " Test1.3 " " Test1.4 "

If if number of td are all same in each tr, you can replace lapply by sapply and you get :

    [,1]        [,2]       
[1,] " Test1.1 " " Test1.3 "
[2,] " Test1.2 " " Test1.4 "

But I think that in this case readHtmlTable is better..

like image 78
agstudy Avatar answered Nov 04 '22 06:11

agstudy


Although the solution of defining the right xPath seems to be better you can do this:

library(XML)
y <- htmlParse(htmlfile)
x <- getNodeSet(y, "//table/tr")
z <- lapply(x, function(x){
                 subDoc <- xmlDoc(x)
                 r <- xpathApply(x, "/td")
                 free(subDoc) # not sure if necessary
                 return(r)
})
like image 30
c0bra Avatar answered Nov 04 '22 06:11

c0bra


Following seem to be working. Essentially you have to search the elements of the list returned by xpathApply

> y <- htmlParse(htmlfile)
> x <- xpathApply(y, "//table/tr")
> x
[[1]]
<tr><td> Test1.1 </td> <td> Test1.2 </td>
</tr> 

[[2]]
<tr><td> Test1.3 </td> <td> Test1.4 </td>
</tr> 

attr(,"class")
[1] "XMLNodeSet"
> z <- xpathApply(x[[1]], "//td")
> z
[[1]]
<td> Test1.1 </td> 

[[2]]
<td> Test1.2 </td> 

[[3]]
<td> Test1.3 </td> 

[[4]]
<td> Test1.4 </td> 

attr(,"class")
[1] "XMLNodeSet"

PS: I am not sure why it searches all elements of x list rather than just x[[1]]. Seems like a bug.

like image 1
CHP Avatar answered Nov 04 '22 06:11

CHP