To do some analysis I want to import a XML to a dataframe using R and the XML package. Example of XML file:
<watchers shop_name="TEST" created_at="September 14, 2012 05:44">
<watcher channel="Site Name">
<code>123456</code>
<search_key>TestKey</search_key>
<date>September 14, 2012 04:15</date>
<result>Found</result>
<link>http://www.test.com/fakeurl</link>
<price>100.0</price>
<shipping>0.0</shipping>
<origposition>0</origposition>
<name>Name Test</name>
<results>
<result position="1">
<c_name>CTest1</c_name>
<c_price>599.49</c_price>
<c_shipping>0.0</c_shipping>
<c_total_price>599.49</c_total_price>
<c_rating>8.3</c_rating>
<c_delivery/>
</result><result position="2">
<c_name>CTest2</c_name>
<c_price>654.0</c_price>
<c_shipping>0.0</c_shipping>
<c_total_price>654.0</c_total_price>
<c_rating>9.8</c_rating>
<c_delivery/>
</result>
<result position="3">
<c_name>CTest3</c_name>
<c_price>654.0</c_price>
<c_shipping>0.0</c_shipping>
<c_total_price>654.0</c_total_price>
<c_rating>8.8</c_rating>
<c_delivery/>
</result>
</results>
</watcher>
</watchers>
I want to have the rows of the dataframe containing the following fields:
shop_name created_at code search_key date result
link price shipping origposition name
position c_name c_price c_shipping c_total_price
c_rating c_delivery
This means that the child nodes must be taken into account as well, which would result in a dataframe of three rows in this example (since the results show 3 positions). The fields
shop_name created_at code search_key
date result link price shipping
origposition name
are the same for each of these rows.
I am able to go through the XML file, but I am unable to get a dataframe with the fields i want. When I convert the dataframe to a dataframe I get the following fields:
"code" "search_key" "date" "result"
"link" "price" "shipping" "origposition"
"name" "results"
Here the fields
shop_name created_at
are missing at the beginning and the 'results' are put together in a String under the column "results".
It must be possible to get the wanted dataframe, but I do not know how to do this exactly.
UPDATE
The solution provided by @MvG works brilliantly on the test XML file stated above. However the column 'result' can also have the value "Not Found". Entries with this value will miss certain fields (always the same filed) and therefore yield a "number of columns of arguments do not match"-error when running the solution. I would like these entries to be put in the dataframe as well, with the fields that are not present left empty. I do not understand how to incorporate this scenario.
test.xml
<watchers shop_name="TEST" created_at="September 14, 2012 05:44">
<watcher channel="Site Name">
<code>123456</code>
<search_key>TestKey</search_key>
<date>September 14, 2012 04:15</date>
<result>Found</result>
<link>http://www.test.com/fakeurl</link>
<price>100.0</price>
<shipping>0.0</shipping>
<origposition>0</origposition>
<name>Name Test</name>
<results>
<result position="1">
<c_name>CTest1</c_name>
<c_price>599.49</c_price>
<c_shipping>0.0</c_shipping>
<c_total_price>599.49</c_total_price>
<c_rating>8.3</c_rating>
<c_delivery/>
</result><result position="2">
<c_name>CTest2</c_name>
<c_price>654.0</c_price>
<c_shipping>0.0</c_shipping>
<c_total_price>654.0</c_total_price>
<c_rating>9.8</c_rating>
<c_delivery/>
</result>
<result position="3">
<c_name>CTest3</c_name>
<c_price>654.0</c_price>
<c_shipping>0.0</c_shipping>
<c_total_price>654.0</c_total_price>
<c_rating>8.8</c_rating>
<c_delivery/>
</result>
</results>
</watcher>
<watcher channel="Shopping">
<code>12804</code>
<search_key></search_key>
<date></date>
<result>Not found</result>
<link>https://www.test.com/testing1323p</link>
<price>0.0</price>
<shipping>0.0</shipping>
<origposition>0</origposition>
<name>MOOVM6002020</name>
<results>
</results>
</watcher>
</watchers>
Here is a more generic approach. Every node is classified as one of three cases:
rows
, then the data frames from child nodes will result in different rows of the result.cols
, then the data frames from child nodes will result in different columns of the result.value
, then a data frame with a single value will be constructed, using the node name as the column name and the node value as the column value.The call for your application is given towards the bottom.
library(XML)
zeroColSingleRow <- function() {
res <- data.frame(dummy=NA)
res$dummy <- NULL
stopifnot(nrow(res) == 1, ncol(res) == 0)
return (res)
}
xml2df <- function(node, classifier) {
if (! inherits(node, c("XMLInternalElementNode", "XMLElementNode"))) {
return (zeroColSingleRow())
}
kind <- classifier(node)
if (kind == "rows") {
cdf <- lapply(xmlChildren(node), xml2df, classifier)
if (length(cdf) == 0) {
res <- zeroColSingleRow()
}
else {
names <- unique(unlist(lapply(cdf, colnames)))
cdf <- lapply(cdf, function(i) {
missing <- setdiff(names, colnames(i))
if (length(missing) > 0) {
i[missing] <- NA
}
return (i)
})
res <- do.call(rbind, cdf)
}
}
else if (kind == "cols") {
cdf <- lapply(xmlChildren(node), xml2df, classifier)
if (length(cdf) == 0) {
res <- zeroColSingleRow()
}
else {
res <- cdf[[1]]
if (length(cdf) > 1) {
for (i in 2:length(cdf)) {
res <- merge(res, cdf[[i]], by=NULL)
}
}
}
}
else {
stopifnot(kind == "value")
res <- data.frame(xmlValue(node))
names(res) <- xmlName(node)
}
if (ncol(res) == 0) {
res <- zeroColSingleRow()
}
attr <- xmlAttrs(node)
if (length(attr) > 0) {
attr <- do.call(data.frame, as.list(attr))
res <- merge(attr, res, by=NULL)
}
rownames(res) <- NULL
return(res)
}
doc<-xmlParse("test.xml")
xml2df(xmlRoot(doc), function(node) {
name <- xmlName(node)
if (name %in% c("watchers", "results"))
return("rows")
# make sure to treat results/result different from watcher/result
if (name %in% c("watcher", "result") &&
xmlName(xmlParent(node)) == paste0(name, "s"))
return("cols")
return("value")
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With