Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this element in lxml include the tail?

Tags:

python

html

lxml

Consider this Python script:

from lxml import etree

html = '''
<html xmlns="http://www.w3.org/1999/xhtml">
<head></head>
  <body>
    <p>This is some text followed with 2 citations.<span class="footnote">1</span>
       <span сlass="footnote">2</span>This is some more text.</p>
  </body>
</html>'''

tree = etree.fromstring(html)

for element in tree.findall(".//{*}span"):
    if element.get("class") == 'footnote':
        print(etree.tostring(element, encoding="unicode", pretty_print=True))

The desired output would be the 2 span elements, instead I get:

<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">1</span>
<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">2</span>This is some more text.

Why does it include the text after the element until the end of the parent element?

I'm trying to use lxml to link footnotes and when I a.insert() the span element into the a element I create for it, it's including the text after and so linking large amounts of text I don't want linked.

like image 525
jorbas Avatar asked Nov 22 '13 13:11

jorbas


2 Answers

Specifying with_tail=False will remove the tail text.

print(etree.tostring(element, encoding="unicode", pretty_print=True, with_tail=False))

See lxml.etree.tostring documentation.

like image 171
falsetru Avatar answered Oct 07 '22 21:10

falsetru


It includes the text after the element, because that text belongs to the element.

If you don't want that text to belong to the previous span, it needs to be contained in it's own element. However, you can avoid printing this text when converting the element back to XML with with_tail=False as a parameter to etree.tostring().

You can also simply set the elements tail to '' if you want to remove it from a specific element.

like image 34
Lennart Regebro Avatar answered Oct 07 '22 20:10

Lennart Regebro