lxml's tostring()
function seems quite broken when printing only parts of documents. Witness:
from lxml.html import fragment_fromstring, tostring
frag = fragment_fromstring('<p>This stuff is <em>really</em> great!')
em = frag.cssselect('em').pop(0)
print tostring(em)
I expect <em>really</em>
but instead it prints <em>really</em> great!
which is wrong. The ' great !' is not part of the selected em
. It's not only wrong, it's a pill, at least for processing document-structured XML, where such trailing text will be common.
As I understand it, lxml stores any free text that comes after the current element in the element's .tail
attribute. A scan of the code for tostring()
brings me to ElementTree.py's _write()
function, which clearly always prints the tail. That's correct behavior for whole trees, but not on the last element when rendering a subtree, yet it makes no distinction.
To get a proper tail-free rendering of the selected XML, I tried writing a toxml()
function from scratch to use in its place. It basically worked, but there are many special cases in handling comments, processing instructions, namespaces, encodings, yadda yadda. So I changed gears and now just piggyback tostring()
, post-processing its output to remove the offending .tail
text:
def toxml(e):
""" Replacement for lxml's tostring() method that doesn't add spurious
tail text. """
from lxml.etree import tostring
xml = tostring(e)
if e.tail:
xml = xml[:-len(e.tail)]
return xml
A basic series of tests shows this works nicely.
Critiques and/or suggestions?
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
How about xml = lxml.etree.tostring(e, with_tail=False)
?
from lxml.html import fragment_fromstring
from lxml.etree import tostring
frag = fragment_fromstring('<p>This stuff is <em>really</em> great!')
em = frag.cssselect('em').pop(0)
print tostring(em, with_tail=False)
Looks like with_tail
was added in v2.0; do you have an older version?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With