I have a bunch of XML files and an R script that reads their content into a data frame. However, I got now files which I wanted to parse as usual, but there is something in their namespace definition that doesn't allow me to pick their values normally with XPath expressions.
XML files are like this:
xml_nons.xml
<?xml version="1.0" encoding="UTF-8"?>
<XML>
<Node>
<Name>Name 1</Name>
<Title>Title 1</Title>
<Date>2015</Date>
</Node>
</XML>
And the other:
xml_ns.xml
<?xml version="1.0" encoding="UTF-8"?>
<XML xmlns="http://www.nonexistingsite.com">
<Node>
<Name>Name 2</Name>
<Title>Title 2</Title>
<Date>2014</Date>
</Node>
</XML>
The URL where xmlns points to doesn't exist.
The R code I use is like this:
library(XML)
xmlfiles <- list.files(path = ".",
pattern="*.xml$",
full.names = TRUE,
recursive = TRUE)
n <- length(xmlfiles)
dat <- vector("list", n)
for(i in 1:n){
doc <- xmlTreeParse(xmlfiles[i], useInternalNodes = TRUE)
nodes <- getNodeSet(doc, "//XML")
x <- lapply(nodes, function(x){ data.frame(
Filename = xmlfiles[i],
Name = xpathSApply(x, ".//Node/Name" , xmlValue),
Title = xpathSApply(x, ".//Node/Title" , xmlValue),
Date = xpathSApply(x, ".//Node/Date" , xmlValue)
)})
dat[[i]] <- do.call("rbind", x)
}
xml <- do.call("rbind", dat)
xml
However, what I get as a result is:
Filename Name Title Date
./xml_nons.xml Name 1 Title 1 2015
If I remove the namespace link from the second file I get correct:
Filename Name Title Date
./xml_nons_1.xml Name 1 Title 1 2015
./xml_ns_1.xml Name 2 Title 2 2014
Of course I could have an XSL to remove those namespaces from original XML files, but I would like to have some solution that works within R. Is there some way to tell R just to ignore everything in the XML declaration?
When you use multiple namespaces in an XML document, you can define one namespace as the default namespace to create a cleaner looking document. The default namespace is declared in the root element and applies to all unqualified elements in the document. Default namespaces apply to elements only, not to attributes.
An XML namespace is a collection of names that can be used as element or attribute names in an XML document. The namespace qualifies element names uniquely on the Web in order to avoid conflicts between elements with the same name.
An XML file can be read in R using the function xmlParse() . Then, load data is stored in a list. An XML file can also be read in the form of a data frame by using the xmlToDataFrame() method.
One of the primary motivations for defining an XML namespace is to avoid naming conflicts when using and re-using multiple vocabularies. XML Schema is used to create a vocabulary for an XML instance, and uses namespaces heavily.
I think there is no easy way to ignore the namespaces. The best way is to learn to live with them. This answer will use the newer XML2 package. But the same applies to the XML package solution.
Use
library(XML2)
fname='myfile.xml'
doc <- read_xml(fname)
#peak at the namespaces
xml_ns(doc)
The first namespace is assigned to d1. If you XPath does not find what you want, the most likely cause is the namespace issue.
xpath <- "//d1:FormDef"
ns <- xml_find_all(doc,xpath, xml_ns(doc))
ns
Also, you have to do this for every element in the path So to save typing, you can do
library(stringr)
> xpath <- "/ODM/Study"
> (xpath<-str_replace_all(xpath,'/','/d1:'))
[1] "/d1:ODM/d1:Study"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With