Consider the following XML
example
library(xml2)
myxml <- read_xml('
<data>
<obs ID="a">
<name> John </name>
<hobby> tennis </hobby>
<hobby> golf </hobby>
<skill> python </skill>
</obs>
<obs ID="b">
<name> Robert </name>
<skill> R </skill>
</obs>
</data>
')
Here I would like to get an (R or Pandas) dataframe from this XML that contains the columns name
and hobby
.
However, as you see, there is an alignment problem because hobby
is missing in the second node and John has two hobbies.
in R, I know how to extract specific values one at a time, for instance using xml2
as follows:
myxml%>%
xml_find_all("//name") %>%
xml_text()
myxml%>%
xml_find_all("//hobby") %>%
xml_text()
but how can I align this data correctly in a dataframe? That is, how can I obtain a dataframe as follows (note how I join with a |
the two hobbies of John):
# A tibble: 2 × 3
name hobby skill
<chr> <chr> <chr>
1 John tennis|golf python
2 Robert <NA> R
In R, I would prefer a solution using xml2
and dplyr
. In Python, I want to end-up with a Pandas dataframe. Also, in my xml there are many more variables I want to parse. I would like a solution that has allows the user to parse additional variables without messing too much with the code.
Thanks!
EDIT: thanks to everyone for these great solutions. All of them were really nice, with plenty of details and it was hard to pick up the best one. Thanks again!
A general R solution that does not require to hardcode the variables.
Using xml2
and tidyverse's purrr
:
library(xml2)
library(purrr)
myxml %>%
xml_find_all('obs') %>%
# Enter each obs and return a df
map_df(~{
# Scan names
node_names <- .x %>%
xml_children() %>%
xml_name() %>%
unique()
# Remember ob
ob <- .x
# Enter each node
map(node_names, ~{
# Find similar nodes
node <- xml_find_all(ob, .x) %>%
xml_text(trim = TRUE) %>%
paste0(collapse = '|') %>%
'names<-'(.x)
# ^ we need to name the element to
# overwrite it with its 'sibilings'
}) %>%
# Return an 'ob' vector
flatten()
})
#> # A tibble: 2 × 3
#> name hobby skill
#> <chr> <chr> <chr>
#> 1 John tennis|golf python
#> 2 Robert <NA> R
obs
, find and store the node names in that obs.obs
, collapse them and store in a list.rbind
(implicit in map_df()
) each 'flatted' list into the resulting data.frame
.myxml <- read_xml('
<data>
<obs ID="a">
<name> John </name>
<hobby> tennis </hobby>
<hobby> golf </hobby>
<skill> python </skill>
</obs>
<obs ID="b">
<name> Robert </name>
<skill> R </skill>
</obs>
</data>
')
pandas
import pandas as pd
from collections import defaultdict
import xml.etree.ElementTree as ET
xml_txt = """<data>
<obs ID="a">
<name> John </name>
<hobby> tennis </hobby>
<hobby> golf </hobby>
<skill> python </skill>
</obs>
<obs ID="b">
<name> Robert </name>
<skill> R </skill>
</obs>
</data>"""
etree = ET.fromstring(xml_txt)
def obs2series(o):
d = defaultdict(list)
[d[c.tag].append(c.text.strip()) for c in o.getchildren()];
return pd.Series(d).str.join('|')
pd.DataFrame([obs2series(o) for o in etree.findall('obs')])
hobby name skill
0 tennis|golf John python
1 NaN Robert R
How It Works
et = ET.parse('my_data.xml')
etree.findall('obs')
returns a list of elements within the xml
structure that are 'obs'
tagspd.Series
constructor obs2series
obs2series
I loop through all child nodes in one 'obs'
element.defaultdict
defaults to a list
meaning I can append to a value even if the key hasn't been seen before.pd.Series
to get a series of lists.pd.Series.str.join('|')
I convert this to a series of strings as I wanted.pd.DataFrame
constructor.Create a function that can handle missing or multiple nodes, and then apply that to the obs
nodes. I added the id column so you can see how to use xmlGetAttr
too (use "."
for the obs node and the leading "."
on other nodes so its relative to that current node in the set).
xpath2 <-function(x, ...){
y <- xpathSApply(x, ...)
ifelse(length(y) == 0, NA, paste(trimws(y), collapse=", "))
}
obs <- getNodeSet(doc, "//obs")
data.frame( id = sapply(obs, xpath2, ".", xmlGetAttr, "ID"),
name = sapply(obs, xpath2, ".//name", xmlValue),
hobbies = sapply(obs, xpath2, ".//hobby", xmlValue),
skill = sapply(obs, xpath2, ".//skill", xmlValue))
id name hobbies skill
1 a John tennis, golf python
2 b Robert <NA> R
I don't use xml2
very often, but maybe get the obs
nodes and then apply xml_find_all
if there are duplicate tags instead of using xml_find_first
.
obs <- xml_find_all(myxml, "//obs")
lapply(obs, xml_find_all, ".//hobby")
data_frame(
name = xml_find_first(obs, ".//name") %>% xml_text(trim=TRUE),
hobbies = sapply(obs, function(x) paste(xml_text( xml_find_all(x, ".//hobby"), trim=TRUE), collapse=", " ) ),
skill = xml_find_first(obs, ".//skill") %>% xml_text(trim=TRUE)
)
# A tibble: 2 x 3
name hobbies skill
<chr> <chr> <chr>
1 John tennis, golf python
2 Robert R
I tested both methods using the medline17n0853.xml
file at the NCBI ftp. This is a 280 MB file with 30,000 PubmedArticle nodes, and the XML package took 102 seconds to parse pubmed ids, journals and combine multiple publication types. The xml2 code ran for 30 minutes and then I killed it, so that may not be the best solution.
In R, I'd probably use
library(XML)
lst <- xmlToList(xmlParse(myxml)[['/data']])
(df <- data.frame(t(sapply(lst, function(x) {
c(x['name'], hobby=paste0(x[which(names(x)=='hobby')], collapse="|"))
}))) )
# name hobby
# 1 John tennis | golf
# 2 Robert
and maybe do some polishing using df[df==""] <- NA
and trimws()
to remove whitespaces.
Or:
library(xml2)
library(dplyr)
`%|||%` <- function (x, y) if (length(x)==0) y else x
(df <- data_frame(
names = myxml %>%
xml_find_all("/data/obs/name") %>%
xml_text(trim=TRUE),
hobbies = myxml %>%
xml_find_all("/data/obs") %>%
lapply(function(x) xml_text(xml_find_all(x, "hobby"), T) %|||% NA_character_)
))
# # A tibble: 2 × 2
# names hobbies
# <chr> <list>
# 1 John <chr [2]>
# 2 Robert <chr [1]>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With