I need to completely remove elements, based on the contents of an attribute, using python's lxml. Example:
import lxml.etree as et
xml="""
<groceries>
<fruit state="rotten">apple</fruit>
<fruit state="fresh">pear</fruit>
<fruit state="fresh">starfruit</fruit>
<fruit state="rotten">mango</fruit>
<fruit state="fresh">peach</fruit>
</groceries>
"""
tree=et.fromstring(xml)
for bad in tree.xpath("//fruit[@state=\'rotten\']"):
#remove this element from the tree
print et.tostring(tree, pretty_print=True)
I would like this to print:
<groceries>
<fruit state="fresh">pear</fruit>
<fruit state="fresh">starfruit</fruit>
<fruit state="fresh">peach</fruit>
</groceries>
Is there a way to do this without storing a temporary variable and printing to it manually, as:
newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
newxml+=et.tostring(elt)
newxml+="</groceries>"
Use the remove
method of an xmlElement :
tree=et.fromstring(xml)
for bad in tree.xpath("//fruit[@state=\'rotten\']"):
bad.getparent().remove(bad) # here I grab the parent of the element to call the remove directly on it
print et.tostring(tree, pretty_print=True, xml_declaration=True)
If I had to compare with the @Acorn version, mine will work even if the elements to remove are not directly under the root node of your xml.
You're looking for the remove
function. Call the tree's remove method and pass it a subelement to remove.
import lxml.etree as et
xml="""
<groceries>
<fruit state="rotten">apple</fruit>
<fruit state="fresh">pear</fruit>
<punnet>
<fruit state="rotten">strawberry</fruit>
<fruit state="fresh">blueberry</fruit>
</punnet>
<fruit state="fresh">starfruit</fruit>
<fruit state="rotten">mango</fruit>
<fruit state="fresh">peach</fruit>
</groceries>
"""
tree=et.fromstring(xml)
for bad in tree.xpath("//fruit[@state='rotten']"):
bad.getparent().remove(bad)
print et.tostring(tree, pretty_print=True)
Result:
<groceries>
<fruit state="fresh">pear</fruit>
<fruit state="fresh">starfruit</fruit>
<fruit state="fresh">peach</fruit>
</groceries>
I met one situation:
<div>
<script>
some code
</script>
text here
</div>
div.remove(script)
will remove the text here
part which I didn't mean to.
following the answer here, I found that etree.strip_elements
is a better solution for me, which you can control whether or not you will remove the text behind with with_tail=(bool)
param.
But still I don't know if this can use xpath filter for tag. Just put this for informing.
Here is the doc:
strip_elements(tree_or_element, *tag_names, with_tail=True)
Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the
with_tail
keyword argument option to False.Tag names can contain wildcards as in
_Element.iter
.Note that this will not delete the element (or ElementTree root element) that you passed even if it matches. It will only treat its descendants. If you want to include the root element, check its tag name directly before even calling this function.
Example usage::
strip_elements(some_element, 'simpletagname', # non-namespaced tag '{http://some/ns}tagname', # namespaced tag '{http://some/other/ns}*' # any tag from a namespace lxml.etree.Comment # comments )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With