I have a very complicated xml file need to parse and present in dataframe format in R. The structure may similar to the following example. The nodes are not paralleled.
<Root>
<A>
<info1>a</info1>
<child>
<info2>b</info2>
<info3>c</info3>
<info4>d</info4>
</child>
<info5>e</info5>
</A>
<B>
<info6>f</info6>
<info7>g</info7>
</B>
</Root>
I come up some code to parse the file:
doc <- xmlParse(file="sample.xml", useInternal = TRUE)
rootnode <- xmlRoot(doc)
df1<-xmlToDataFrame(nodes=getNodeSet(rootnode, "//Root/A"))
df2<-xmlToDataFrame(nodes=getNodeSet(rootnode, "//Root/B"))
Final<-cbind.data.frame(df1,df2, all=TRUE)
The result returned as: (all the value form node were shrink together)
info1 child info5 info6 info7
a bcd e f g
However, the ideal result I want is:
info1 info2 info3 info4 info5 info6 info7
a b c d e f g
Because there are large number of nodes in the xml file similar to the situation above, it is not wise to manually manipulate the dataframe.
I also try to change the path statement to "//Root/A/child", then all the value under node A and node B will be missed.
Does anyone could offer the solution to this problem. Thanks in advance.
One can try xmlToList and unlist to reduce xml data in named vector format. The names can be changed using gsub to match OP's expectations as:
library(XML)
result <- unlist(xmlToList(xmlParse(xml)))
#Change the name to refer only child
names(result) <- gsub(".*\\.(\\w+)$","\\1", names(result))
result
# info1 info2 info3 info4 info5 info6 info7
# "a" "b" "c" "d" "e" "f" "g"
Data:
xml <- "<Root>
<A>
<info1>a</info1>
<child>
<info2>b</info2>
<info3>c</info3>
<info4>d</info4>
</child>
<info5>e</info5>
</A>
<B>
<info6>f</info6>
<info7>g</info7>
</B>
</Root>"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With