I wanted to parse a fairly huge xml-like file which doesn't have any root element. The format of the file is:
<tag1>
<tag2>
</tag2>
</tag1>
<tag1>
<tag3/>
</tag1>
What I tried:
The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.
lxml.html
can parse fragments:
from lxml import html
s = """<tag1>
<tag2>
</tag2>
</tag1>
<tag1>
<tag3/>
</tag1>"""
doc = html.fromstring(s)
for thing in doc:
print thing
for other in thing:
print other
"""
>>>
<Element tag1 at 0x3411a80>
<Element tag2 at 0x3428990>
<Element tag1 at 0x3428930>
<Element tag3 at 0x3411a80>
>>>
"""
Courtesy this SO answer
And if there is more than one level of nesting:
def flatten(nested):
"""recusively flatten nested elements
yields individual elements
"""
for thing in nested:
yield thing
for other in flatten(thing):
yield other
doc = html.fromstring(s)
for thing in flatten(doc):
print thing
Similarly, lxml.etree.HTML
will parse this. It adds html and body tags:
d = etree.HTML(s)
for thing in d.iter():
print thing
"""
<Element html at 0x3233198>
<Element body at 0x322fcb0>
<Element tag1 at 0x3233260>
<Element tag2 at 0x32332b0>
<Element tag1 at 0x322fcb0>
<Element tag3 at 0x3233148>
"""
How about instead of editing the file do something like this
import xml.etree.ElementTree as ET
with file("xml-file.xml") as f:
xml_object = ET.fromstringlist(["<root>", f.read(), "</root>"])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With