Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing XML in R: Incorrect namespaces

I have a bunch of XML files and an R script that reads their content into a data frame. However, I got now files which I wanted to parse as usual, but there is something in their namespace definition that doesn't allow me to pick their values normally with XPath expressions.

XML files are like this:

xml_nons.xml

<?xml version="1.0" encoding="UTF-8"?>
<XML>
   <Node>
      <Name>Name 1</Name>
      <Title>Title 1</Title>
      <Date>2015</Date>
   </Node>
</XML>

And the other:

xml_ns.xml

<?xml version="1.0" encoding="UTF-8"?>
<XML xmlns="http://www.nonexistingsite.com">
   <Node>
      <Name>Name 2</Name>
      <Title>Title 2</Title>
      <Date>2014</Date>
   </Node>
</XML>

The URL where xmlns points to doesn't exist.

The R code I use is like this:

library(XML)

xmlfiles <- list.files(path = ".", 
                       pattern="*.xml$", 
                       full.names = TRUE, 
                       recursive = TRUE)

n <- length(xmlfiles)
dat <- vector("list", n)

for(i in 1:n){
       doc <- xmlTreeParse(xmlfiles[i], useInternalNodes = TRUE)
       nodes <- getNodeSet(doc, "//XML")
       x <- lapply(nodes, function(x){ data.frame(
              Filename = xmlfiles[i],
              Name = xpathSApply(x, ".//Node/Name" , xmlValue),
              Title = xpathSApply(x, ".//Node/Title" , xmlValue),
              Date = xpathSApply(x, ".//Node/Date" , xmlValue)
            )})
            dat[[i]] <- do.call("rbind", x)
    }

    xml <- do.call("rbind", dat)
    xml

However, what I get as a result is:

Filename            Name    Title    Date
./xml_nons.xml      Name 1  Title 1  2015

If I remove the namespace link from the second file I get correct:

Filename            Name    Title    Date
./xml_nons_1.xml    Name 1  Title 1  2015
./xml_ns_1.xml      Name 2  Title 2  2014

Of course I could have an XSL to remove those namespaces from original XML files, but I would like to have some solution that works within R. Is there some way to tell R just to ignore everything in the XML declaration?

like image 408
nikopartanen Avatar asked Mar 20 '15 15:03

nikopartanen


People also ask

Can XML have multiple namespaces?

When you use multiple namespaces in an XML document, you can define one namespace as the default namespace to create a cleaner looking document. The default namespace is declared in the root element and applies to all unqualified elements in the document. Default namespaces apply to elements only, not to attributes.

Can XML attributes have namespaces?

An XML namespace is a collection of names that can be used as element or attribute names in an XML document. The namespace qualifies element names uniquely on the Web in order to avoid conflicts between elements with the same name.

How do I read an XML file in R?

An XML file can be read in R using the function xmlParse() . Then, load data is stored in a list. An XML file can also be read in the form of a data frame by using the xmlToDataFrame() method.

Why is namespace important in XML?

One of the primary motivations for defining an XML namespace is to avoid naming conflicts when using and re-using multiple vocabularies. XML Schema is used to create a vocabulary for an XML instance, and uses namespaces heavily.


1 Answers

I think there is no easy way to ignore the namespaces. The best way is to learn to live with them. This answer will use the newer XML2 package. But the same applies to the XML package solution.

Use

library(XML2)
fname='myfile.xml'
doc <- read_xml(fname)
#peak at the namespaces
xml_ns(doc)

The first namespace is assigned to d1. If you XPath does not find what you want, the most likely cause is the namespace issue.

xpath <-  "//d1:FormDef"
ns <- xml_find_all(doc,xpath, xml_ns(doc))
ns

Also, you have to do this for every element in the path So to save typing, you can do

library(stringr)
> xpath <-  "/ODM/Study"
> (xpath<-str_replace_all(xpath,'/','/d1:'))
[1] "/d1:ODM/d1:Study"
like image 150
userJT Avatar answered Nov 02 '22 22:11

userJT