I am trying to extract information from an XML file from ClinicalTrials.gov. The file is organized in the following way:
<clinical_study>
...
<brief_title>
...
<location>
<facility>
<name>
<address>
<city>
<state>
<zip>
<country>
</facility>
<status>
<contact>
<last_name>
<phone>
<email>
</contact>
</location>
<location>
...
</location>
...
</clinical_study>
I can use the R XML package from CRAN in the following code to extract all location nodes from the XML file:
library(XML)
clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)
locations <- xmlToDataFrame(getNodeSet(xmlDoc,"//location"))
This works kind of ok. However, if you look at the data frame, you will notice that the xmlToDataFrame function lumped together everything under <facility>
into a single concatenated string. A solution would be to write code to generate the data frame column by column, for example, you could generate
You could flatten the XML first.
flatten_xml <- function(x) {
if (length(xmlChildren(x)) == 0) structure(list(xmlValue(x)), .Names = xmlName(xmlParent(x)))
else Reduce(append, lapply(xmlChildren(x), flatten_xml))
}
dfs <- lapply(getNodeSet(xmlDoc,"//location"), function(x) data.frame(flatten_xml(x)))
allnames <- unique(c(lapply(dfs, colnames), recursive = TRUE))
df <- do.call(rbind, lapply(dfs, function(df) { df[, setdiff(allnames,colnames(df))] <- NA; df }))
head(df)
# city state zip country status last_name phone email last_name.1
# 1 Birmingham Alabama 35294 United States Recruiting Louis B Nabors, MD 205-934-1813 [email protected] Louis B Nabors, MD
# 2 Mobile Alabama 36604 United States Recruiting Melanie Alford, RN 251-445-9649 [email protected] Pamela Francisco, CCRP
# 3 Phoenix Arizona 85013 United States Recruiting Lynn Ashby, MD 602-406-6262 [email protected] Lynn Ashby, MD
# 4 Tucson Arizona 85724 United States Recruiting Jamie Holt 520-626-6800 [email protected] Baldassarre Stea, MD, PhD
# 5 Little Rock Arkansas 72205 United States Recruiting Wilma Brooks, RN 501-686-8530 [email protected] Amanda Eubanks, APN
# 6 Berkeley California 94704 United States Withdrawn <NA> <NA> <NA> <NA>
This answer converts the XML to a list, unlists each location section, transposes the section, converts the section to a data.table
, and then uses rbindlist
to merge all of the individual locations into one table. The fill=T
argument matches the elements by name, and fills in missing element values with NA
.
library(XML); library(data.table)
clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true"
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE)
xmlToDT <- function(doc, path) {
rbindlist(
lapply(getNodeSet(doc, path),
function(x) data.table(t(unlist(xmlToList(x))))
), fill=T)
}
locationDT <- xmlToDT(xmlDoc, "//location")
locationDT[1:6]
## facility.name facility.address.city facility.address.state facility.address.zip
## 1: "HYGEIA" Hospital Marousi District of Attica 151 23
## 2: Allina Health, Abbott Northwestern Hospital, John Nasseff Neuroscience Institute Minneapolis Minnesota 55407
## 3: Amrita Institute of Medical Sciences and Research Centre, Kochi Kochi Kerala 682 026
## 4: Anne Arundel Medical Center Annapolis Maryland 21401
## 5: Atlanta Cancer Care Atlanta Georgia 30005
## 6: Austin Health Heidelberg Victoria 3084
## facility.address.country
## 1: Greece
## 2: United States
## 3: India
## 4: United States
## 5: United States
## 6: Australia
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With