R Fast XML Parsing

Question

What is the fastest way to convert XML files to data frames in R currently?

The XML looks like this: (Note- not all rows have all fields)

  <row>
    <ID>001</ID>
    <age>50</age>
    <field3>blah</field3>
    <field4 />
  </row>
  <row>
    <ID>001</ID>
    <age>50</age>
    <field4 />
  </row>

I have tried two approaches:

The xmlToDataFrame function from the XML library
The speed oriented xmlToDF function posted here

For an 8.5 MB file, with 1.6k "rows" and 114 "columns", xmlToDataFrame took 25.1 seconds, while xmlToDF took 16.7 seconds on my machine.

These times are quite large, when compared with python XML parsers (eg. xml.etree.ElementTree) which was able to do the job in 0.4 seconds.

Is there a faster way to do this in R, or is there something fundamental in R that prevents us making this faster?

Some light on this would be really helpful!

Randy Lai · Accepted Answer

Updated for the comments

d = xmlRoot(doc)
size = xmlSize(d)

names = NULL
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    names = unique(c(names, names(v)))
}

for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    cat(paste(v[names], collapse=","), "
", file="a.csv", append=TRUE)
}

This finishes in about 0.4 second for a 1000x100 xml record. If you know the variable name, you can even omit the first for loop.

Note: if you xml content contains commas, quotation marks, you may have to take special care about them. In this case, I recommend the next method.

if you want to construct the data.frame dynamically, you can do this with data.table, data.table is a little bit slower than the above csv method, but faster than data.frame

m = data.table(matrix(NA,nc=length(names), nr=size))
setnames(m, names)
for (n in names) mode(m[[n]]) = "character"
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    m[i, names(v):= as.list(v), with=FALSE]
}
for (n in names) m[, n:= type.convert(m[[n]], as.is=TRUE), with=FALSE]

It finishes in about 1.1 second for the same document.

R Fast XML Parsing

Tags:

r

xml

user997943

1 Answers

Randy Lai

Recent Activity

Donate For Us

R Fast XML Parsing

Tags:

r

xml

user997943

1 Answers

Randy Lai

Related questions

Recent Activity

Donate For Us