lxml's <code>tostring()</code> function seems quite broken when printing only parts of documents. Witness: <pre class="prettyprint"><code>from lxml.html import fragment_fromstring, tostring frag = fragment_fromstring('This stuff is really great!') em = frag.cssselect('em').pop(0) print tostring(em) </code></pre> I expect <code>really</code> but instead it prints <code>really great!</code> which is wrong. The ' great !' is not part of the selected <code>em</code>. It's not only wrong, it's a pill, at least for processing document-structured XML, where such trailing text will be common. As I understand it, lxml stores any free text that comes after the current element in the element's <code>.tail</code> attribute. A scan of the code for <code>tostring()</code> brings me to ElementTree.py's <code>_write()</code> function, which clearly always prints the tail. That's correct behavior for whole trees, but not on the last element when rendering a subtree, yet it makes no distinction. To get a proper tail-free rendering of the selected XML, I tried writing a <code>toxml()</code> function from scratch to use in its place. It basically worked, but there are many special cases in handling comments, processing instructions, namespaces, encodings, yadda yadda. So I changed gears and now just piggyback <code>tostring()</code>, post-processing its output to remove the offending <code>.tail</code> text: <pre class="prettyprint"><code>def toxml(e): """ Replacement for lxml's tostring() method that doesn't add spurious tail text. """ from lxml.etree import tostring xml = tostring(e) if e.tail: xml = xml[:-len(e.tail)] return xml </code></pre> A basic series of tests shows this works nicely. Critiques and/or suggestions?

How about <code>xml = lxml.etree.tostring(e, with_tail=False)</code>? <pre class="prettyprint"><code>from lxml.html import fragment_fromstring from lxml.etree import tostring frag = fragment_fromstring('This stuff is really great!') em = frag.cssselect('em').pop(0) print tostring(em, with_tail=False) </code></pre> Looks like <code>with_tail</code> was added in v2.0; do you have an older version?

Fixing tostring() in Python's lxml

Tags:

python

xml

lxml

lxml's tostring() function seems quite broken when printing only parts of documents. Witness:

from lxml.html import fragment_fromstring, tostring
frag = fragment_fromstring('<p>This stuff is <em>really</em> great!')
em = frag.cssselect('em').pop(0)
print tostring(em)

I expect really but instead it prints really great! which is wrong. The ' great !' is not part of the selected em. It's not only wrong, it's a pill, at least for processing document-structured XML, where such trailing text will be common.

As I understand it, lxml stores any free text that comes after the current element in the element's .tail attribute. A scan of the code for tostring() brings me to ElementTree.py's _write() function, which clearly always prints the tail. That's correct behavior for whole trees, but not on the last element when rendering a subtree, yet it makes no distinction.

To get a proper tail-free rendering of the selected XML, I tried writing a toxml() function from scratch to use in its place. It basically worked, but there are many special cases in handling comments, processing instructions, namespaces, encodings, yadda yadda. So I changed gears and now just piggyback tostring(), post-processing its output to remove the offending .tail text:

def toxml(e):
    """ Replacement for lxml's tostring() method that doesn't add spurious
    tail text. """

    from lxml.etree import tostring
    xml = tostring(e)
    if e.tail:
        xml = xml[:-len(e.tail)]
    return xml

A basic series of tests shows this works nicely.

Critiques and/or suggestions?

436

asked Jan 05 '11 22:01

Jonathan Eunice

1 Answers

How about xml = lxml.etree.tostring(e, with_tail=False)?

from lxml.html import fragment_fromstring
from lxml.etree import tostring
frag = fragment_fromstring('<p>This stuff is <em>really</em> great!')
em = frag.cssselect('em').pop(0)
print tostring(em, with_tail=False)

Looks like with_tail was added in v2.0; do you have an older version?

143

answered Oct 19 '22 17:10

kindall

Related questions
                            
                                Python's Popen cleanup
                            
                                Python: eliminating stack traces into library code?
                            
                                How to pass file descriptors from parent to child in python?
                            
                                On Ubuntu, how do you install a newer version of python and keep the older python version?
                            
                                Changing a files metadata in Python
                            
                                How to use logical OR in SPARQL regex()?
                            
                                How can I tell if a file is a descendant of a given directory?
                            
                                Can a class contain an instance of itself as a data container?
                            
                                Why is Python 3 (or later) better than Python 2?
                            
                                Python DBM Module for Windows?
                            
                                Flipping bits in python
                            
                                selenium.wait_for_condition equivalent in Python bindings for WebDriver
                            
                                sandbox to execute possibly unfriendly python code [duplicate]
                            
                                "unknown column X.id" error in django using existing DB
                            
                                using BeautifulSoup to insert an element before closing body
                            
                                In Python's PIL, how do I change the quality of an image? [closed]
                            
                                PyQt: removing unnecessary columns
                            
                                getting the class path or name space of a class in python even if it is nested
                            
                                Django Asynchronous Processing
                            
                                wxPython problems with wrapping staticText

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With