Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: convert XML data to data frame

Tags:

dataframe

r

xml

For a homework assignment I am attempting to convert an XML file into a data frame in R. I have tried many different things, and I have searched for ideas on the internet but have been unsuccessful. Here is my code so far:

library(XML) url <- 'http://www.ggobi.org/book/data/olive.xml' doc <- xmlParse(myUrl) root <- xmlRoot(doc)  dataFrame <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) data.frame(t(dataFrame),row.names=NULL) 

The output I get is like a giant vector of numbers. I am attempting to organize the data into a data frame, but I do not know how to properly adjust my code to obtain that.

like image 755
mapleleaf Avatar asked Oct 31 '15 00:10

mapleleaf


People also ask

How do I read an XML file into a DataFrame?

Reading with lxmlparse() and give it the path to XML file. To get the root element, we will use getroot() on the parsed XML data. Now we can loop through the children elements of the root node and write them into a Python list. Like before, we'll create a DataFrame using the data list, and transpose the DataFrame.

How do I import XML data into R?

An XML file can be read in R using the function xmlParse() . Then, load data is stored in a list. An XML file can also be read in the form of a data frame by using the xmlToDataFrame() method.

How do I convert a data frame in R?

as. data. frame() function in R Programming Language is used to convert an object to data frame. These objects can be Vectors, Lists, Matrices, and Factors.


2 Answers

It may not be as verbose as the XML package but xml2 doesn't have the memory leaks and is laser-focused on data extraction. I use trimws which is a really recent addition to R core.

library(xml2)  pg <- read_xml("http://www.ggobi.org/book/data/olive.xml")  # get all the <record>s recs <- xml_find_all(pg, "//record")  # extract and clean all the columns vals <- trimws(xml_text(recs))  # extract and clean (if needed) the area names labs <- trimws(xml_attr(recs, "label"))  # mine the column names from the two variable descriptions # this XPath construct lets us grab either the <categ…> or <real…> tags # and then grabs the 'name' attribute of them cols <- xml_attr(xml_find_all(pg, "//data/variables/*[self::categoricalvariable or                                                       self::realvariable]"), "name")  # this converts each set of <record> columns to a data frame # after first converting each row to numeric and assigning # names to each column (making it easier to do the matrix to data frame conv) dat <- do.call(rbind, lapply(strsplit(vals, "\ +"),                                  function(x) {                                    data.frame(rbind(setNames(as.numeric(x),cols)))                                  }))  # then assign the area name column to the data frame dat$area_name <- labs  head(dat) ##   region area palmitic palmitoleic stearic oleic linoleic linolenic ## 1      1    1     1075          75     226  7823      672        NA ## 2      1    1     1088          73     224  7709      781        31 ## 3      1    1      911          54     246  8113      549        31 ## 4      1    1      966          57     240  7952      619        50 ## 5      1    1     1051          67     259  7771      672        50 ## 6      1    1      911          49     268  7924      678        51 ##   arachidic eicosenoic    area_name ## 1        60         29 North-Apulia ## 2        61         29 North-Apulia ## 3        63         29 North-Apulia ## 4        78         35 North-Apulia ## 5        80         46 North-Apulia ## 6        70         44 North-Apulia 

UPDATE

I'd prbly do the last bit this way now:

library(tidyverse)  strsplit(vals, "[[:space:]]+") %>%    map_df(~as_data_frame(as.list(setNames(., cols)))) %>%    mutate(area_name=labs) 
like image 97
hrbrmstr Avatar answered Sep 29 '22 05:09

hrbrmstr


Great answers above! For future readers, anytime you face a complex XML needing R import, consider re-structuring the XML document using XSLT (a special-purpose declarative programming language that manipulates XML content into various end-use needs). Then simply use R's xmlToDataFrame() function from XML package.

Unfortunately, R does not have a dedicated XSLT package available on CRAN-R across all operating systems. The listed SXLT seems to be a Linux package and not able to be used on Windows. See unanswered SO questions here and here. I understand @hrbrmstr (above) maintains a GitHub XSLT project. Nonetheless, nearly all general-purpose languages maintain XSLT processors including Java, C#, Python, PHP, Perl, and VB.

Below is the open-source Python route and because the XML document is pretty nuanced, two XSLTs are being used (of course XSLT gurus can combine them into one but tried as I might couldn't get it to work.

FIRST XSLT (using a recursive template)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:strip-space elements="*"/>  <!-- Identity Transform -->     <xsl:template match="node()|@*">     <xsl:copy>        <xsl:apply-templates select="node()|@*"/>     </xsl:copy> </xsl:template>  <xsl:template match="record/text()" name="tokenize">             <xsl:param name="text" select="."/>     <xsl:param name="separator" select="' '"/>     <xsl:choose>                     <xsl:when test="not(contains($text, $separator))">                             <data>                 <xsl:value-of select="normalize-space($text)"/>             </data>                       </xsl:when>         <xsl:otherwise>             <data>                                   <xsl:value-of select="normalize-space(substring-before($text, $separator))"/>                               </data>                               <xsl:call-template name="tokenize">                 <xsl:with-param name="text" select="substring-after($text, $separator)"/>             </xsl:call-template>                         </xsl:otherwise>                 </xsl:choose>         </xsl:template>       <xsl:template match="description|variables|categoricalvariable|realvariable">         </xsl:template>  

SECOND XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">      <!-- Identity Transform -->         <xsl:template match="records">         <xsl:copy>            <xsl:apply-templates select="node()|@*"/>         </xsl:copy>     </xsl:template>      <xsl:template match="record">         <record>             <area_name><xsl:value-of select="@label"/></area_name>             <area><xsl:value-of select="data[1]"/></area>             <region><xsl:value-of select="data[2]"/></region>             <palmitic><xsl:value-of select="data[3]"/></palmitic>             <palmitoleic><xsl:value-of select="data[4]"/></palmitoleic>             <stearic><xsl:value-of select="data[5]"/></stearic>             <oleic><xsl:value-of select="data[6]"/></oleic>             <linoleic><xsl:value-of select="data[7]"/></linoleic>             <linolenic><xsl:value-of select="data[8]"/></linolenic>             <arachidic><xsl:value-of select="data[9]"/></arachidic>             <eicosenoic><xsl:value-of select="data[10]"/></eicosenoic>         </record>    </xsl:template>           </xsl:stylesheet> 

Python (using lxml module)

import lxml.etree as ET  cd = os.path.dirname(os.path.abspath(__file__))  # FIRST TRANSFORMATION dom = ET.parse('http://www.ggobi.org/book/data/olive.xml') xslt = ET.parse(os.path.join(cd, 'Olive.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom)  tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)  xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close()      # SECOND TRANSFORMATION dom = ET.parse(os.path.join(cd, 'Olive_py.xml')) xslt = ET.parse(os.path.join(cd, 'Olive2.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom)  tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)      xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close() 

R

library(XML)  # LOADING TRANSFORMED XML INTO R DATA FRAME doc<-xmlParse("Olive_py.xml") xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//record")) View(xmldf) 

Output

area_name   area    region  palmitic    palmitoleic stearic oleic   linoleic    linolenic   arachidic   eicosenoic North-Apulia 1      1       1075        75          226     7823        672          na                     60 North-Apulia 1      1       1088        73          224     7709        781          31          61         29 North-Apulia 1      1       911         54          246     8113        549          31          63         29 North-Apulia 1      1       966         57          240     7952        619          50          78         35 North-Apulia 1      1       1051        67          259     7771        672          50          80         46    ... 

(slight cleanup on very first record is needed as an extra space was added after "na" in xml doc, so arachidic and eicosenoic were shifted forward)

like image 29
Parfait Avatar answered Sep 29 '22 04:09

Parfait