What is the fastest way to convert XML files to data frames in R currently?
The XML looks like this: (Note- not all rows have all fields)
<row>
<ID>001</ID>
<age>50</age>
<field3>blah</field3>
<field4 />
</row>
<row>
<ID>001</ID>
<age>50</age>
<field4 />
</row>
I have tried two approaches:
For an 8.5 MB file, with 1.6k "rows" and 114 "columns", xmlToDataFrame took 25.1 seconds, while xmlToDF took 16.7 seconds on my machine.
These times are quite large, when compared with python XML parsers (eg. xml.etree.ElementTree) which was able to do the job in 0.4 seconds.
Is there a faster way to do this in R, or is there something fundamental in R that prevents us making this faster?
Some light on this would be really helpful!
Updated for the comments
d = xmlRoot(doc)
size = xmlSize(d)
names = NULL
for(i in 1:size){
v = getChildrenStrings(d[[i]])
names = unique(c(names, names(v)))
}
for(i in 1:size){
v = getChildrenStrings(d[[i]])
cat(paste(v[names], collapse=","), "\n", file="a.csv", append=TRUE)
}
This finishes in about 0.4 second for a 1000x100 xml record. If you know the variable name, you can even omit the first for loop.
Note: if you xml content contains commas, quotation marks, you may have to take special care about them. In this case, I recommend the next method.
if you want to construct the data.frame dynamically, you can do this with data.table
, data.table
is a little bit slower than the above csv method, but faster than data.frame
m = data.table(matrix(NA,nc=length(names), nr=size))
setnames(m, names)
for (n in names) mode(m[[n]]) = "character"
for(i in 1:size){
v = getChildrenStrings(d[[i]])
m[i, names(v):= as.list(v), with=FALSE]
}
for (n in names) m[, n:= type.convert(m[[n]], as.is=TRUE), with=FALSE]
It finishes in about 1.1 second for the same document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With