I'm trying to learn R's XML
package. I'm trying to create a data.frame from books.xml sample xml data file. Here's what I get:
library(XML) books <- "http://www.w3schools.com/XQuery/books.xml" doc <- xmlTreeParse(books, useInternalNodes = TRUE) doc xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x)))) xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " ")) xpathSApply(doc, "//book/child::*", xmlValue)
Each of these xpathSApply's don't get me even close to my intention. How should one proceed toward a well formed data.frame?
Reading with lxmlparse() and give it the path to XML file. To get the root element, we will use getroot() on the parsed XML data. Now we can loop through the children elements of the root node and write them into a Python list. Like before, we'll create a DataFrame using the data list, and transpose the DataFrame.
Read XML as pandas dataframe You will need to identify the path to the "root" tag in the XML from which you want to extract the data. By default, pandas-read-xml will treat the root tag as being the "rows" of the pandas dataframe.
General applications: XML provides a standard method to access information, making it easier for applications and devices of all kinds to use, store, transmit, and display data.
Ordinarily, I would suggest trying the xmlToDataFrame()
function, but I believe that this will actually be fairly tricky because it isn't well structured to begin with.
I would recommend working with this function:
xmlToList(books)
One problem is that there are multiple authors per book, so you will need to decide how to handle that when you're structuring your data frame.
Once you have decided what to do with the multiple authors issue, then it's fairly straight forward to turn your book list into a data frame with the ldply()
function in plyr (or just use lapply and convert the return value into a data.frame by using do.call("rbind"...).
Here's a complete example (excluding author):
library(XML) books <- "w3schools.com/xsl/books.xml" library(plyr) ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } ) .id title.text title..attrs year price .attrs 1 book Everyday Italian en 2005 30.00 COOKING 2 book Harry Potter en 2005 29.99 CHILDREN 3 book XQuery Kick Start en 2003 49.99 WEB 4 book Learning XML en 2003 39.95 WEB
Here's what it looks like with author included. You need to use ldply
in this instance since the list is "jagged"...lapply can't handle that properly. [Otherwise you can use lapply
with rbind.fill
(also courtesy of Hadley), but why bother when plyr
automatically does it for you?]:
ldply(xmlToList(books), data.frame) .id title.text title..attrs author year price .attrs 1 book Everyday Italian en Giada De Laurentiis 2005 30.00 COOKING 2 book Harry Potter en J K. Rowling 2005 29.99 CHILDREN 3 book XQuery Kick Start en James McGovern 2003 49.99 WEB 4 book Learning XML en Erik T. Ray 2003 39.95 WEB author.1 author.2 author.3 author.4 1 <NA> <NA> <NA> <NA> 2 <NA> <NA> <NA> <NA> 3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan 4 <NA> <NA> <NA> <NA>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With