Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Fast XML Parsing

Tags:

r

xml

What is the fastest way to convert XML files to data frames in R currently?

The XML looks like this: (Note- not all rows have all fields)

  <row>
    <ID>001</ID>
    <age>50</age>
    <field3>blah</field3>
    <field4 />
  </row>
  <row>
    <ID>001</ID>
    <age>50</age>
    <field4 />
  </row>

I have tried two approaches:

  1. The xmlToDataFrame function from the XML library
  2. The speed oriented xmlToDF function posted here

For an 8.5 MB file, with 1.6k "rows" and 114 "columns", xmlToDataFrame took 25.1 seconds, while xmlToDF took 16.7 seconds on my machine.

These times are quite large, when compared with python XML parsers (eg. xml.etree.ElementTree) which was able to do the job in 0.4 seconds.

Is there a faster way to do this in R, or is there something fundamental in R that prevents us making this faster?

Some light on this would be really helpful!

like image 875
user997943 Avatar asked Apr 06 '14 01:04

user997943


1 Answers

Updated for the comments

d = xmlRoot(doc)
size = xmlSize(d)

names = NULL
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    names = unique(c(names, names(v)))
}

for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    cat(paste(v[names], collapse=","), "\n", file="a.csv", append=TRUE)
}

This finishes in about 0.4 second for a 1000x100 xml record. If you know the variable name, you can even omit the first for loop.

Note: if you xml content contains commas, quotation marks, you may have to take special care about them. In this case, I recommend the next method.


if you want to construct the data.frame dynamically, you can do this with data.table, data.table is a little bit slower than the above csv method, but faster than data.frame

m = data.table(matrix(NA,nc=length(names), nr=size))
setnames(m, names)
for (n in names) mode(m[[n]]) = "character"
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    m[i, names(v):= as.list(v), with=FALSE]
}
for (n in names) m[, n:= type.convert(m[[n]], as.is=TRUE), with=FALSE]

It finishes in about 1.1 second for the same document.

like image 157
Randy Lai Avatar answered Sep 29 '22 14:09

Randy Lai