I have a following xml document:
<node0>
<node1>
<node2 a1="x1"> ... </node2>
<node2 a1="x2"> ... </node2>
<node2 a1="x1"> ... </node2>
</node1>
</node0>
I want to filter out node2
when a1="x2"
. The user provides the xpath and attribute values that need to tested and filtered out. I looked at some solutions in python like BeautifulSoup but they are too complicated and dont preserve the case of text. I want to keep the document same as before with some stuff filtered out.
Can you recommend a simple and succinct solution? This should not be too complicated from the looks of it. The actual xml document is not as simple as above but idea is the same.
Python allows parsing these XML documents using two modules namely, the xml. etree. ElementTree module and Minidom (Minimal DOM Implementation). Parsing means to read information from a file and split it into pieces by identifying parts of that particular XML file.
Filter enables you to extract all or selected content from source XML files. You can specify the elements and attributes to extract from a document by using either the API or an INI file (see Configure Element Extraction for XML Documents). Filter detects the following XML formats: generic XML.
The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data. Changed in version 3.3: This module will use a fast implementation whenever available.
set(key, value): We can set the attribute key on the element using set. append(subelement): This one is used to append the child element to the root or the sub elements to the main element.
This uses xml.etree.ElementTree
which is in the standard library:
import xml.etree.ElementTree as xee
data='''\
<node1>
<node2 a1="x1"> ... </node2>
<node2 a1="x2"> ... </node2>
<node2 a1="x1"> ... </node2>
</node1>
'''
doc=xee.fromstring(data)
for tag in doc.findall('node2'):
if tag.attrib['a1']=='x2':
doc.remove(tag)
print(xee.tostring(doc))
# <node1>
# <node2 a1="x1"> ... </node2>
# <node2 a1="x1"> ... </node2>
# </node1>
This uses lxml
, which is not in the standard library, but has a more powerful syntax:
import lxml.etree
data='''\
<node1>
<node2 a1="x1"> ... </node2>
<node2 a1="x2"> ... </node2>
<node2 a1="x1"> ... </node2>
</node1>
'''
doc = lxml.etree.XML(data)
e=doc.find('node2/[@a1="x2"]')
doc.remove(e)
print(lxml.etree.tostring(doc))
# <node1>
# <node2 a1="x1"> ... </node2>
# <node2 a1="x1"> ... </node2>
# </node1>
Edit: If node2
is buried more deeply in the xml, then you can iterate through all the tags, check each parent tag to see if the node2
element is one of its children, and the remove it if so:
Using only xml.etree.ElementTree:
doc=xee.fromstring(data)
for parent in doc.getiterator():
for child in parent.findall('node2'):
if child.attrib['a1']=='x2':
parent.remove(child)
Using lxml:
doc = lxml.etree.XML(data)
for parent in doc.iter('*'):
child=parent.find('node2/[@a1="x2"]')
if child is not None:
parent.remove(child)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With