Consider this Python script:
from lxml import etree
html = '''
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body>
<p>This is some text followed with 2 citations.<span class="footnote">1</span>
<span сlass="footnote">2</span>This is some more text.</p>
</body>
</html>'''
tree = etree.fromstring(html)
for element in tree.findall(".//{*}span"):
if element.get("class") == 'footnote':
print(etree.tostring(element, encoding="unicode", pretty_print=True))
The desired output would be the 2 span
elements, instead I get:
<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">1</span>
<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">2</span>This is some more text.
Why does it include the text after the element until the end of the parent element?
I'm trying to use lxml to link footnotes and when I a.insert()
the span
element into the a
element I create for it, it's including the text after and so linking large amounts of text I don't want linked.
Specifying with_tail=False
will remove the tail text.
print(etree.tostring(element, encoding="unicode", pretty_print=True, with_tail=False))
See lxml.etree.tostring
documentation.
It includes the text after the element, because that text belongs to the element.
If you don't want that text to belong to the previous span, it needs to be contained in it's own element. However, you can avoid printing this text when converting the element back to XML with with_tail=False
as a parameter to etree.tostring()
.
You can also simply set the elements tail to ''
if you want to remove it from a specific element.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With