For a homework assignment I am attempting to convert an XML file into a data frame in R. I have tried many different things, and I have searched for ideas on the internet but have been unsuccessful. Here is my code so far:
library(XML) url <- 'http://www.ggobi.org/book/data/olive.xml' doc <- xmlParse(myUrl) root <- xmlRoot(doc) dataFrame <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) data.frame(t(dataFrame),row.names=NULL)
The output I get is like a giant vector of numbers. I am attempting to organize the data into a data frame, but I do not know how to properly adjust my code to obtain that.
Reading with lxmlparse() and give it the path to XML file. To get the root element, we will use getroot() on the parsed XML data. Now we can loop through the children elements of the root node and write them into a Python list. Like before, we'll create a DataFrame using the data list, and transpose the DataFrame.
An XML file can be read in R using the function xmlParse() . Then, load data is stored in a list. An XML file can also be read in the form of a data frame by using the xmlToDataFrame() method.
as. data. frame() function in R Programming Language is used to convert an object to data frame. These objects can be Vectors, Lists, Matrices, and Factors.
It may not be as verbose as the XML
package but xml2
doesn't have the memory leaks and is laser-focused on data extraction. I use trimws
which is a really recent addition to R core.
library(xml2) pg <- read_xml("http://www.ggobi.org/book/data/olive.xml") # get all the <record>s recs <- xml_find_all(pg, "//record") # extract and clean all the columns vals <- trimws(xml_text(recs)) # extract and clean (if needed) the area names labs <- trimws(xml_attr(recs, "label")) # mine the column names from the two variable descriptions # this XPath construct lets us grab either the <categ…> or <real…> tags # and then grabs the 'name' attribute of them cols <- xml_attr(xml_find_all(pg, "//data/variables/*[self::categoricalvariable or self::realvariable]"), "name") # this converts each set of <record> columns to a data frame # after first converting each row to numeric and assigning # names to each column (making it easier to do the matrix to data frame conv) dat <- do.call(rbind, lapply(strsplit(vals, "\ +"), function(x) { data.frame(rbind(setNames(as.numeric(x),cols))) })) # then assign the area name column to the data frame dat$area_name <- labs head(dat) ## region area palmitic palmitoleic stearic oleic linoleic linolenic ## 1 1 1 1075 75 226 7823 672 NA ## 2 1 1 1088 73 224 7709 781 31 ## 3 1 1 911 54 246 8113 549 31 ## 4 1 1 966 57 240 7952 619 50 ## 5 1 1 1051 67 259 7771 672 50 ## 6 1 1 911 49 268 7924 678 51 ## arachidic eicosenoic area_name ## 1 60 29 North-Apulia ## 2 61 29 North-Apulia ## 3 63 29 North-Apulia ## 4 78 35 North-Apulia ## 5 80 46 North-Apulia ## 6 70 44 North-Apulia
UPDATE
I'd prbly do the last bit this way now:
library(tidyverse) strsplit(vals, "[[:space:]]+") %>% map_df(~as_data_frame(as.list(setNames(., cols)))) %>% mutate(area_name=labs)
Great answers above! For future readers, anytime you face a complex XML needing R import, consider re-structuring the XML document using XSLT (a special-purpose declarative programming language that manipulates XML content into various end-use needs). Then simply use R's xmlToDataFrame()
function from XML package.
Unfortunately, R does not have a dedicated XSLT package available on CRAN-R across all operating systems. The listed SXLT seems to be a Linux package and not able to be used on Windows. See unanswered SO questions here and here. I understand @hrbrmstr (above) maintains a GitHub XSLT project. Nonetheless, nearly all general-purpose languages maintain XSLT processors including Java, C#, Python, PHP, Perl, and VB.
Below is the open-source Python route and because the XML document is pretty nuanced, two XSLTs are being used (of course XSLT gurus can combine them into one but tried as I might couldn't get it to work.
FIRST XSLT (using a recursive template)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:strip-space elements="*"/> <!-- Identity Transform --> <xsl:template match="node()|@*"> <xsl:copy> <xsl:apply-templates select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="record/text()" name="tokenize"> <xsl:param name="text" select="."/> <xsl:param name="separator" select="' '"/> <xsl:choose> <xsl:when test="not(contains($text, $separator))"> <data> <xsl:value-of select="normalize-space($text)"/> </data> </xsl:when> <xsl:otherwise> <data> <xsl:value-of select="normalize-space(substring-before($text, $separator))"/> </data> <xsl:call-template name="tokenize"> <xsl:with-param name="text" select="substring-after($text, $separator)"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="description|variables|categoricalvariable|realvariable"> </xsl:template>
SECOND XSLT
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <!-- Identity Transform --> <xsl:template match="records"> <xsl:copy> <xsl:apply-templates select="node()|@*"/> </xsl:copy> </xsl:template> <xsl:template match="record"> <record> <area_name><xsl:value-of select="@label"/></area_name> <area><xsl:value-of select="data[1]"/></area> <region><xsl:value-of select="data[2]"/></region> <palmitic><xsl:value-of select="data[3]"/></palmitic> <palmitoleic><xsl:value-of select="data[4]"/></palmitoleic> <stearic><xsl:value-of select="data[5]"/></stearic> <oleic><xsl:value-of select="data[6]"/></oleic> <linoleic><xsl:value-of select="data[7]"/></linoleic> <linolenic><xsl:value-of select="data[8]"/></linolenic> <arachidic><xsl:value-of select="data[9]"/></arachidic> <eicosenoic><xsl:value-of select="data[10]"/></eicosenoic> </record> </xsl:template> </xsl:stylesheet>
Python (using lxml module)
import lxml.etree as ET cd = os.path.dirname(os.path.abspath(__file__)) # FIRST TRANSFORMATION dom = ET.parse('http://www.ggobi.org/book/data/olive.xml') xslt = ET.parse(os.path.join(cd, 'Olive.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom) tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True) xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close() # SECOND TRANSFORMATION dom = ET.parse(os.path.join(cd, 'Olive_py.xml')) xslt = ET.parse(os.path.join(cd, 'Olive2.xsl')) transform = ET.XSLT(xslt) newdom = transform(dom) tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True) xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb') xmlfile.write(tree_out) xmlfile.close()
R
library(XML) # LOADING TRANSFORMED XML INTO R DATA FRAME doc<-xmlParse("Olive_py.xml") xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//record")) View(xmldf)
Output
area_name area region palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic North-Apulia 1 1 1075 75 226 7823 672 na 60 North-Apulia 1 1 1088 73 224 7709 781 31 61 29 North-Apulia 1 1 911 54 246 8113 549 31 63 29 North-Apulia 1 1 966 57 240 7952 619 50 78 35 North-Apulia 1 1 1051 67 259 7771 672 50 80 46 ...
(slight cleanup on very first record is needed as an extra space was added after "na" in xml doc, so arachidic
and eicosenoic
were shifted forward)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With