Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load xml "rows" into R data table

Tags:

dataframe

r

xml

I have some data shaped like this:

<people>
  <person first="Mary" last="Jane" sex="F" />
  <person first="Susan" last="Smith" sex="F" height="168" />
  <person last="Black" first="Joseph" sex="M" />
  <person first="Jessica" last="Jones" sex="F" />
</people>

I would like a data frame that looks like this:

    first  last sex height
1    Mary  Jane   F     NA
2   Susan Smith   F    168
3  Joseph Black   M     NA
4 Jessica Jones   F     NA

I've gotten this far:

library(XML)
xpeople <- xmlRoot(xmlParse(xml))
lst <- xmlApply(xpeople, xmlAttrs)
names(lst) <- 1:length(lst)

But I can't for the life of me figure out how to get the list into the data frame. I can get the list to be "square" (i.e. fill in the gaps) and then put it into a data frame:

lst <- xmlApply(xpeople, function(node) {
  attrs = xmlAttrs(node)
  if (!("height" %in% names(attrs))) {
    attrs[["height"]] <- NA
  }
  attrs
})
df = as.data.frame(lst)

But I have the following problems:

  1. The data frame is transposed
  2. first and last are Factors, not chr
  3. height is a Factor, not numeric
  4. the first and last names got swapped around for Joseph Black (not a big issue since my data is normally consistent, but annoying nonetheless)

How can I get the data frame in the correct form?

like image 956
dwurf Avatar asked Oct 20 '22 01:10

dwurf


1 Answers

txt <- '<people>
          <person first="Mary" last="Jane" sex="F" />
          <person first="Susan" last="Smith" sex="F" height="168" />
          <person last="Black" first="Joseph" sex="M" />
          <person first="Jessica" last="Jones" sex="F" />
        </people>'
library(XML)         # for xmlTreeParse
library(data.table)  # for rbindlist(...)
xml <- xmlTreeParse(txt, asText=TRUE, useInternalNodes = TRUE)
rbindlist(lapply(xml["//person"],function(x)as.list(xmlAttrs(x))),fill=TRUE)
#      first  last sex height
# 1:    Mary  Jane   F     NA
# 2:   Susan Smith   F    168
# 3:  Joseph Black   M     NA
# 4: Jessica Jones   F     NA

You need as.list(xmlAttrs(...)) instead of just xmlAttrs(...) because rbindlist(...) wants each argument to be a list, not a vector.

like image 109
jlhoward Avatar answered Oct 22 '22 01:10

jlhoward