For a homework assignment I am attempting to convert an XML file into a data frame in R. I have tried many different things, and I have searched for ideas on the internet but have been unsuccessful. Here is my code so far: <pre class="prettyprint"><code>library(XML) url <- 'http://www.ggobi.org/book/data/olive.xml' doc <- xmlParse(myUrl) root <- xmlRoot(doc) dataFrame <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) data.frame(t(dataFrame),row.names=NULL) </code></pre> The output I get is like a giant vector of numbers. I am attempting to organize the data into a data frame, but I do not know how to properly adjust my code to obtain that.

Great answers above! For future readers, anytime you face a complex XML needing R import, consider re-structuring the XML document using XSLT (a special-purpose declarative programming language that manipulates XML content into various end-use needs). Then simply use R's <code>xmlToDataFrame()</code> function from XML package. Unfortunately, R does not have a dedicated XSLT package available on CRAN-R across all operating systems. The listed SXLT seems to be a Linux package and not able to be used on Windows. See unanswered SO questions here and here. I understand @hrbrmstr (above) maintains a GitHub XSLT project. Nonetheless, nearly all general-purpose languages maintain XSLT processors including Java, C#, Python, PHP, Perl, and VB. Below is the open-source Python route and because the XML document is pretty nuanced, two XSLTs are being used (of course XSLT gurus can combine them into one but tried as I might couldn't get it to work. FIRST XSLT (using a recursive template) <pre class="prettyprint"><code><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:strip-space elements="*"/>  <xsl:template match="node()|@*"> <xsl:copy> <xsl:apply-templates select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="record/text()" name="tokenize"> <xsl:param name="text" select="."/> <xsl:param name="separator" select="' '"/> <xsl:choose> <xsl:when test="not(contains($text, $separator))"> <data> <xsl:value-of select="normalize-space($text)"/> </data> </xsl:when> <xsl:otherwise> <data> <xsl:value-of select="normalize-space(substring-before($text, $separator))"/> </data> <xsl:call-template name="tokenize"> <xsl:with-param name="text" select="substring-after($text, $separator)"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="description|variables|categoricalvariable|realvariable"> </xsl:template> </code></pre> SECOND XSLT <pre class="prettyprint"><code><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">  <xsl:template match="records"> <xsl:copy> <xsl:apply-templates select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="record"> <record> <area_name><xsl:value-of select="@label"/></area_name> <area><xsl:value-of select="data[1]"/></area> <region><xsl:value-of select="data[2]"/></region> <palmitic><xsl:value-of select="data[3]"/></palmitic> <palmitoleic><xsl:value-of select="data[4]"/></palmitoleic> <stearic><xsl:value-of select="data[5]"/></stearic> <oleic><xsl:value-of select="data[6]"/></oleic> <linoleic><xsl:value-of select="data[7]"/></linoleic> <linolenic><xsl:value-of select="data[8]"/></linolenic> <arachidic><xsl:value-of select="data[9]"/></arachidic> <eicosenoic><xsl:value-of select="data[10]"/></eicosenoic> </record> </xsl:template> </xsl:stylesheet> </code></pre> Python (using lxml module) <pre class="prettyprint"><code>import lxml.etree as ET cd = os.path.dirname(os.path.abspath(__file__)) # FIRST TRANSFORMATION dom = ET.parse('http://www.ggobi.org/book/data/olive.xml') xslt = ET.parse(os.path.join(cd, 'Olive.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom) tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True) xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close() # SECOND TRANSFORMATION dom = ET.parse(os.path.join(cd, 'Olive_py.xml')) xslt = ET.parse(os.path.join(cd, 'Olive2.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom) tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True) xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close() </code></pre> R <pre class="prettyprint"><code>library(XML) # LOADING TRANSFORMED XML INTO R DATA FRAME doc<-xmlParse("Olive_py.xml") xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//record")) View(xmldf) </code></pre> Output <pre class="prettyprint"><code>area_name area region palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic North-Apulia 1 1 1075 75 226 7823 672 na 60 North-Apulia 1 1 1088 73 224 7709 781 31 61 29 North-Apulia 1 1 911 54 246 8113 549 31 63 29 North-Apulia 1 1 966 57 240 7952 619 50 78 35 North-Apulia 1 1 1051 67 259 7771 672 50 80 46 ... </code></pre> (slight cleanup on very first record is needed as an extra space was added after "na" in xml doc, so <code>arachidic</code> and <code>eicosenoic</code> were shifted forward)

R: convert XML data to data frame

Tags:

dataframe

r

xml

For a homework assignment I am attempting to convert an XML file into a data frame in R. I have tried many different things, and I have searched for ideas on the internet but have been unsuccessful. Here is my code so far:

library(XML) url <- 'http://www.ggobi.org/book/data/olive.xml' doc <- xmlParse(myUrl) root <- xmlRoot(doc)  dataFrame <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) data.frame(t(dataFrame),row.names=NULL)

The output I get is like a giant vector of numbers. I am attempting to organize the data into a data frame, but I do not know how to properly adjust my code to obtain that.

755

asked Oct 31 '15 00:10

mapleleaf

2 Answers

It may not be as verbose as the XML package but xml2 doesn't have the memory leaks and is laser-focused on data extraction. I use trimws which is a really recent addition to R core.

library(xml2)  pg <- read_xml("http://www.ggobi.org/book/data/olive.xml")  # get all the <record>s recs <- xml_find_all(pg, "//record")  # extract and clean all the columns vals <- trimws(xml_text(recs))  # extract and clean (if needed) the area names labs <- trimws(xml_attr(recs, "label"))  # mine the column names from the two variable descriptions # this XPath construct lets us grab either the <categ…> or <real…> tags # and then grabs the 'name' attribute of them cols <- xml_attr(xml_find_all(pg, "//data/variables/*[self::categoricalvariable or                                                       self::realvariable]"), "name")  # this converts each set of <record> columns to a data frame # after first converting each row to numeric and assigning # names to each column (making it easier to do the matrix to data frame conv) dat <- do.call(rbind, lapply(strsplit(vals, "\ +"),                                  function(x) {                                    data.frame(rbind(setNames(as.numeric(x),cols)))                                  }))  # then assign the area name column to the data frame dat$area_name <- labs  head(dat) ##   region area palmitic palmitoleic stearic oleic linoleic linolenic ## 1      1    1     1075          75     226  7823      672        NA ## 2      1    1     1088          73     224  7709      781        31 ## 3      1    1      911          54     246  8113      549        31 ## 4      1    1      966          57     240  7952      619        50 ## 5      1    1     1051          67     259  7771      672        50 ## 6      1    1      911          49     268  7924      678        51 ##   arachidic eicosenoic    area_name ## 1        60         29 North-Apulia ## 2        61         29 North-Apulia ## 3        63         29 North-Apulia ## 4        78         35 North-Apulia ## 5        80         46 North-Apulia ## 6        70         44 North-Apulia

UPDATE

I'd prbly do the last bit this way now:

library(tidyverse)  strsplit(vals, "[[:space:]]+") %>%    map_df(~as_data_frame(as.list(setNames(., cols)))) %>%    mutate(area_name=labs)

answered Sep 29 '22 05:09

hrbrmstr

Great answers above! For future readers, anytime you face a complex XML needing R import, consider re-structuring the XML document using XSLT (a special-purpose declarative programming language that manipulates XML content into various end-use needs). Then simply use R's xmlToDataFrame() function from XML package.

Unfortunately, R does not have a dedicated XSLT package available on CRAN-R across all operating systems. The listed SXLT seems to be a Linux package and not able to be used on Windows. See unanswered SO questions here and here. I understand @hrbrmstr (above) maintains a GitHub XSLT project. Nonetheless, nearly all general-purpose languages maintain XSLT processors including Java, C#, Python, PHP, Perl, and VB.

Below is the open-source Python route and because the XML document is pretty nuanced, two XSLTs are being used (of course XSLT gurus can combine them into one but tried as I might couldn't get it to work.

FIRST XSLT (using a recursive template)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:strip-space elements="*"/>  <!-- Identity Transform -->     <xsl:template match="node()|@*">     <xsl:copy>        <xsl:apply-templates select="node()|@*"/>     </xsl:copy> </xsl:template>  <xsl:template match="record/text()" name="tokenize">             <xsl:param name="text" select="."/>     <xsl:param name="separator" select="' '"/>     <xsl:choose>                     <xsl:when test="not(contains($text, $separator))">                             <data>                 <xsl:value-of select="normalize-space($text)"/>             </data>                       </xsl:when>         <xsl:otherwise>             <data>                                   <xsl:value-of select="normalize-space(substring-before($text, $separator))"/>                               </data>                               <xsl:call-template name="tokenize">                 <xsl:with-param name="text" select="substring-after($text, $separator)"/>             </xsl:call-template>                         </xsl:otherwise>                 </xsl:choose>         </xsl:template>       <xsl:template match="description|variables|categoricalvariable|realvariable">         </xsl:template>

SECOND XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">      <!-- Identity Transform -->         <xsl:template match="records">         <xsl:copy>            <xsl:apply-templates select="node()|@*"/>         </xsl:copy>     </xsl:template>      <xsl:template match="record">         <record>             <area_name><xsl:value-of select="@label"/></area_name>             <area><xsl:value-of select="data[1]"/></area>             <region><xsl:value-of select="data[2]"/></region>             <palmitic><xsl:value-of select="data[3]"/></palmitic>             <palmitoleic><xsl:value-of select="data[4]"/></palmitoleic>             <stearic><xsl:value-of select="data[5]"/></stearic>             <oleic><xsl:value-of select="data[6]"/></oleic>             <linoleic><xsl:value-of select="data[7]"/></linoleic>             <linolenic><xsl:value-of select="data[8]"/></linolenic>             <arachidic><xsl:value-of select="data[9]"/></arachidic>             <eicosenoic><xsl:value-of select="data[10]"/></eicosenoic>         </record>    </xsl:template>           </xsl:stylesheet>

Python (using lxml module)

import lxml.etree as ET  cd = os.path.dirname(os.path.abspath(__file__))  # FIRST TRANSFORMATION dom = ET.parse('http://www.ggobi.org/book/data/olive.xml') xslt = ET.parse(os.path.join(cd, 'Olive.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom)  tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)  xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close()      # SECOND TRANSFORMATION dom = ET.parse(os.path.join(cd, 'Olive_py.xml')) xslt = ET.parse(os.path.join(cd, 'Olive2.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom)  tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)      xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close()

library(XML)  # LOADING TRANSFORMED XML INTO R DATA FRAME doc<-xmlParse("Olive_py.xml") xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//record")) View(xmldf)

Output

area_name   area    region  palmitic    palmitoleic stearic oleic   linoleic    linolenic   arachidic   eicosenoic North-Apulia 1      1       1075        75          226     7823        672          na                     60 North-Apulia 1      1       1088        73          224     7709        781          31          61         29 North-Apulia 1      1       911         54          246     8113        549          31          63         29 North-Apulia 1      1       966         57          240     7952        619          50          78         35 North-Apulia 1      1       1051        67          259     7771        672          50          80         46    ...

(slight cleanup on very first record is needed as an extra space was added after "na" in xml doc, so arachidic and eicosenoic were shifted forward)

answered Sep 29 '22 04:09

Parfait

Related questions
                            
                                Add multiple custom views to layout programmatically
                            
                                How to preserve an ampersand (&) while using FOR XML PATH on SQL 2005
                            
                                Android Custom View Constructor
                            
                                Loop through all elements in XML using NodeList
                            
                                Scroll behavior in nested RecyclerView with horizontal scroll
                            
                                How to Read XML in .NET?
                            
                                Required Multiple beans of same type in Spring
                            
                                How to add shadow around circular imageview
                            
                                When and Why is XML preferable to CSV? [closed]
                            
                                How to change direction of android elevation shadow?
                            
                                Any experiences with Protocol Buffers?
                            
                                use xsl to output plain text
                            
                                Serializing Lists of Classes to XML
                            
                                Implementing a custom Decoder in Swift 4
                            
                                Tools for debugging xslt
                            
                                What are the C# documentation tags? [closed]
                            
                                Best compression algorithm for XML?
                            
                                Oracle Pl/SQL: Loop through XMLTYPE nodes
                            
                                MinOccurs 0 and nillable true
                            
                                Emitting namespace specifications with ElementTree in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With