Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse a xml file with multiple root element in python

i have a xml file, and i need to fetch some of the tags from it for some use, which have data like:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein1">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria1" direction="E"/>
        <neighbor name="Switzerland1" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia1" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

i need to parse this, so i used:

import xml.etree.ElementTree as ET
tree = ET.parse("myfile.xml")
root = tree.getroot()

this code giving error at line 2: xml.etree.ElementTree.ParseError: junk after document element:

i think this is because multiple xml tags, do you have any idea, how should i parse this?

like image 733
ggupta Avatar asked Aug 03 '17 05:08

ggupta


People also ask

Can XML file have multiple root elements?

While a properly formed XML file can only have a single root element, an XSD or DTD file can contain multiple roots. If one of the roots matches that in the XML source file, that root element is used, otherwise you need to select one to use.


3 Answers

There's a simple trick I've used to parse such pseudo-XML (Wazuh rule files for what it matters) - just temporarily wrap it inside a fake element <whatever></whatever> thus forming a single root over all these "roots".

In your case, rather than having an invalid XML like this:

<data> ... </data>
<data> ... </data>

Just before passing it to the parser temporarily rewrite it as:

<whatever>
    <data> ... </data>
    <data> ... </data>
</whatever>

Then you parse it as usual and iterate <data> elements.

import xml.etree.ElementTree as etree
import pathlib

file = Path('rules/0020-syslog_rules.xml')
data = b'<rules>' + file.read_bytes() + b'</rules>'
etree.fromstring(data)
etree.findall('group')
... array of Elements ...
like image 99
kravietz Avatar answered Oct 18 '22 22:10

kravietz


This code fills in details for one approach, if you want them.

The code watches for 'accumulated_xml until it encounters the beginning of another xml document or the end of the file. When it has a complete xml document it calls display to exercise the lxml library to parse the document and report some of the contents.

>>> from lxml import etree
>>> def display(alist):
...     tree = etree.fromstring(''.join(alist))
...     for country in tree.xpath('.//country'):
...         print(country.attrib['name'], country.find('rank').text, country.find('year').text)
...         print([neighbour.attrib['name'] for neighbour in country.xpath('neighbor')])
... 
>>> accumulated_xml = []
>>> with open('temp.xml') as temp:
...     while True:
...         line = temp.readline()
...         if line:
...             if line.startswith('<?xml'):
...                 if accumulated_xml:
...                     display (accumulated_xml)
...                     accumulated_xml = []
...             else:
...                 accumulated_xml.append(line.strip())
...         else:
...             display (accumulated_xml)
...             break
... 
Liechtenstein 1 2008
['Austria', 'Switzerland']
Singapore 4 2011
['Malaysia']
Panama 68 2011
['Costa Rica', 'Colombia']
Liechtenstein1 1 2008
['Austria1', 'Switzerland1']
Singapore 4 2011
['Malaysia1']
Panama 68 2011
['Costa Rica', 'Colombia']
like image 43
Bill Bell Avatar answered Oct 19 '22 00:10

Bill Bell


Question: ... any idea, how should i parse this?

Filter the whole File and split into valid <?xml ... Chunks.
Creates myfile_01, myfile_02 ... myfile_nn.

n = 0
out_fh = None
with open('myfile.xml') as in_fh:
    while True:
        line = in_fh.readline()
        if not line: break

        if line.startswith('<?xml'):
            if out_fh:
                out_fh.close()
            n += 1
            out_fh = open('myfile_{:02}'.format(n))

        out_fh.write(line)

    out_fh.close()

If you want all <country> in one XML Tree:

import re
from xml.etree import ElementTree as ET

with open('myfile.xml') as fh:
    root = ET.fromstring('<?xml version="1.0"?><data>{}</data>'.
                         format(''.join(re.findall('<country.*?</country>', fh.read(), re.S)))
                                )

Tested with Python: 3.4.2

like image 36
stovfl Avatar answered Oct 18 '22 22:10

stovfl