I am trying to use xpathApply function in XML package in R to extract certain data from a html file. However, after I use xpathApply on some parent nodes of the html document, the class of the resulting object becomes XMLNodeSet, and I cannot further use xpathApply on such object, as this error message appears: “Error in UseMethod("xpathApply") : no applicable method for 'xpathApply' applied to an object of class "XMLNodeSet"”
Here is the R script I am trying to replicate my problem (this example is just a simple table, I know I can use readHTMLtable function, but I need to use more low level function to work because my actual html is more complicated than this simple table):
library(XML)
y <- htmlParse(htmlfile)
x <- xpathApply(y, "//table/tr")
z <- xpathApply(x, "/td")
Here is the “htmlfile”:
<table>
<tr>
<td> Test1.1 </td> <td> Test1.2 </td>
</tr>
<tr>
<td> Test1.3 </td> <td> Test1.4 </td>
</tr>
</table>
Is there any method to further work on the nodes after using xpathApply? Or are there other good alternatives to play around the data in the nodes?
Once you have a list of node you can apply function on itto extract the node. Function like xmlValue
or xmlGetAttr
....
For example :
x <- xpathApply(y, "//table/tr")
sapply(x,xmlValue) ## it a list of nodes..
" Test1.1 Test1.2 " " Test1.3 Test1.4 "
Which is equivalent to do :
xpathSApply(y,"//table/tr",xmlValue)
" Test1.1 Test1.2 " " Test1.3 Test1.4 "
EDIT
I am sure that you question can be solved by the right xpath. You should learn work with xml files as you work with a data base . xpath is just analogous to an sql query. it is fast and many browsers can help you to get the generate the right xpath.
For example :
xpathSApply(y,"//table/tr[2]/td[1]",xmlValue) # second tr and first td
[1] " Test1.3 "
xpathSApply(y,"//table/tr[2]/td[3]",xmlValue) # second tr and third td
EDIT
THE OP looks like if he wantes to replicate the XML structure ( get tr and td in the same order)
here is the way , I don't think is the more efficient way ...
nn.trs <- length(xpathSApply(y,"//table/tr",I))
lapply(seq(nn.trs),function(i){
xpathSApply(y,paste("//table/tr[",i,"]/td",sep=''),xmlValue)
})
[[1]]
[1] " Test1.1 " " Test1.2 "
[[2]]
[1] " Test1.3 " " Test1.4 "
If if number of td are all same in each tr, you can replace lapply
by sapply
and you get :
[,1] [,2]
[1,] " Test1.1 " " Test1.3 "
[2,] " Test1.2 " " Test1.4 "
But I think that in this case readHtmlTable is better..
Although the solution of defining the right xPath seems to be better you can do this:
library(XML)
y <- htmlParse(htmlfile)
x <- getNodeSet(y, "//table/tr")
z <- lapply(x, function(x){
subDoc <- xmlDoc(x)
r <- xpathApply(x, "/td")
free(subDoc) # not sure if necessary
return(r)
})
Following seem to be working. Essentially you have to search the elements of the list returned by xpathApply
> y <- htmlParse(htmlfile)
> x <- xpathApply(y, "//table/tr")
> x
[[1]]
<tr><td> Test1.1 </td> <td> Test1.2 </td>
</tr>
[[2]]
<tr><td> Test1.3 </td> <td> Test1.4 </td>
</tr>
attr(,"class")
[1] "XMLNodeSet"
> z <- xpathApply(x[[1]], "//td")
> z
[[1]]
<td> Test1.1 </td>
[[2]]
<td> Test1.2 </td>
[[3]]
<td> Test1.3 </td>
[[4]]
<td> Test1.4 </td>
attr(,"class")
[1] "XMLNodeSet"
PS: I am not sure why it searches all elements of x
list rather than just x[[1]]
. Seems like a bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With