Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml: difference between Element addnext() and insert() in handling tail

Tags:

python

lxml

Given an lxml Element xml I iterate over all of its children c[0..n] by calling c.getnext(). That is because I need to insert children on the fly if necessary, and I can't do so using an iterator. All elements have both text and tail set.

Let me illustrate the different behavior of addnext() and insert() with the following example. Assume a simple XML string, that I parse into an lxml tree, and then, just for sanity's sake, inspect it:

>>> import lxml.etree
>>> s = "<p>This is <b>bold</b> and this is italic text.</p>"
# Create a new lxml element.
>>> xml = lxml.etree.fromstring(s)
# Let's look at the element, its child, and all the texts and tails.
>>> lxml.etree.tostring(xml)
b'<p>This is <b>bold</b> and this is italic text.</p>'
>>> xml.text
'This is '
>>> xml.tail
>>> xml[0].text
'bold'
>>> xml[0].tail
' and this is italic text.'

So far so good, and exactly what I would have expected (for more on the lxml representation see here).

Now I want to wrap the word "italic" into tags, just like "bold" is wrapped into <b> tags. To do that, I first find the index at which the "italic" substring starts:

# Find the index of the "italic" substring.
>>> idx = xml[0].tail.find("italic")
>>> idx
13

Then I create a new lxml element:

# Create a new element and inspect it.
>>> new_c = lxml.etree.fromstring("<i>italic</i>")
>>> new_c.text
'italic'
>>> new_c.tail
>>>

To insert this new element into the xml tree properly, I have to split the original xml[0].tail string into two substrings and remove the "italic" from it:

>>> new_c.tail = xml[0].tail[idx+len("italic"):]
>>> xml[0].tail = xml[0].tail[:idx]

Now everything is set up to insert that new element into the xml element, and this is what puzzles me right now. The insertion of the new child new_c after a given one xml[0] had different results, and the Element API doesn't give me any new information:

# Adds the element as a following sibling directly after this element.
# Note that tail text is automatically discarded when adding at the root level.
>>> xml[0].addnext(new_c)
>>> lxml.etree.tostring(xml)
b'<p>This is <b>bold</b><i>italic</i> text. and this is </p>'

and

# Inserts a subelement at the given position in this element
>>> xml.insert(1 + xml.index(xml[0]), new_c)
>>> lxml.etree.tostring(xml)
b'<p>This is <b>bold</b> and this is <i>italic</i> text.</p>'

The two calls seem to handle tail differently (see the comment on addnext() regarding tail). Even taking the comment into account, the text is not discarded from <b> but appended to <i>, nor is the root level handled any differently than levels further down (i.e. the exact same behavior can be observed by wrapping the original XML in s into an additional <foo> tag).

What am I missing here?

EDIT A related discussion on the lxml mailing list is here.

like image 417
Jens Avatar asked Jun 06 '26 16:06

Jens


2 Answers

elem.addnext(nextelem) manipulates on the XML level, i.e. adds something directly after the element moving any tail text behind the newly inserted element. This is done to make the new element a directly following sibling.

parent.insert(where,elem) works exactly as if the parent element is just a list of etree.Element. It puts a new element in the list without any changes to the etree.Element instances. parent.append(elem) will also work this way, or any other list manipulation.

So, these functions have two different views on the element tree.

>>> from lxml import etree as et
>>> 
>>> x = et.XML('<a>foo<b/>bar</a>')
>>> y = et.XML('<c>C!</c>')
>>> 
>>> et.dump(x)
<a>foo<b/>bar</a>
>>> x.find('b').addnext(y)
>>> et.dump(x)
<a>foo<b/><c>C!</c>bar</a>

The tail moves from the b element to the c element, to keep the XML document the same except for the inserted element.

Now, if the inserted Element already has a tail, addnext is used to insert an Element and the text following it. Directly after the XML Element, not after the etree Element-with-tail.

>>> x = et.XML('<a>foo<b/>bar</a>')
>>> y = et.XML('<c>C!</c>')
>>> y.tail = 'more...'
>>> 
>>> x.find('b').addnext(y)
>>> et.dump(x)
<a>foo<b/><c>C!</c>more...bar</a>
like image 82
jensq Avatar answered Jun 08 '26 06:06

jensq


tail only exists on lxml's level; in libxml2, it's a text node just as it is in DOM. The prime reason is the convenience when parsing a pretty-formatted XML (http://lxml.de/tutorial.html#elements-contain-text):

The two properties .text and .tail are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).

All lxml functions strive to maintain that abstraction AFAICS from the source. E.g. index() only counts elements/comments/entityrefs/PI nodes and tree manipulation routines appear to always move a node's tail along with it. However, since this concept

  • is so underdocumented
  • was tailored for XML where a user doesn't care about trailing text
  • conflicts with the regular representation

there appear to be inconsistencies in its application. This looks like one (and a bug if the consistency is a goal). I'd discuss the last statement with the maintainers to clarify the library's intended behaviour regarding tails.

like image 29
ivan_pozdeev Avatar answered Jun 08 '26 06:06

ivan_pozdeev