xml.etree.ElementTree vs. lxml.etree: different internal node representation?

Tags:

I have been transforming some of my original xml.etree.ElementTree (ET) code to lxml.etree (lxmlET). Luckily there are a lot of similarities between the two. However, I did stumble upon some strange behaviour that I cannot find written down in any documentation. It considers the internal representation of descendant nodes.

In ET, iter() is used to iterate over all descendants of an Element, optionally filtered by tag name. Because I could not find any details about this in the documentation, I expected similar behaviour for lxmlET. The thing is that from testing I conclude that in lxmlET, there is a different internal representation of a tree.

In the example below, I iterate over nodes in a tree and print each node's children, but in addition I also create all different combinations of those children and print those. This means, if an element has children ('A', 'B', 'C') I create alterations, namely trees [('A'), ('A', 'B'), ('A', 'C'), ('B'), ('B', 'C'), ('C')].

# import lxml.etree as ET
import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy


def get_combination_trees(tree):
    children = list(tree)
    for i in range(1, len(children)):
        for combination in combinations(children, i):
            new_combo_tree = ET.Element(tree.tag, tree.attrib)
            for recombined_child in combination:
                new_combo_tree.append(recombined_child)
                # when using lxml a deepcopy is required to make this work (or make change in parse_xml)
                # new_combo_tree.append(deepcopy(recombined_child))
            yield new_combo_tree

    return None


def parse_xml(tree_p):
    for node in ET.fromstring(tree_p):
        if not node.tag == 'node_main':
            continue
        # replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
        for subnode in node.iter('node'):
            children = list(subnode)
            if children:
                print('-'.join([child.attrib['id'] for child in children]))
            else:
                print(f'node {subnode.attrib["id"]} has no children')

            for combo_tree in get_combination_trees(subnode):
                combo_children = list(combo_tree)
                if combo_children:
                    print('-'.join([child.attrib['id'] for child in combo_children]))    

    return None


s = '''<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
'''

parse_xml(s)

The expected output here is the id's of the children of each node joined together with a hyphen, and also all possible combinations of the children (cf. supra) in a top-down breadth-first fashion.

2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children

However, when you use the lxml module instead of xml (uncomment the import for lxmlET and comment the import for ET), and run the code you'll see that the output is

2-3
2
3
node 2 has no children

So the deeper descendant nodes are never visited. This can be circumvented by either:

using deepcopy (comment/uncomment relevant part in get_combination_trees()), or
using for subnode in node.xpath('.//node') in parse_xml() instead of iter().

So I know that there is a way around this, but I am mainly wondering what is happening?! It took me ages to debug this, and I can't find any documentation on it. What is going on, what is the actual underlying difference here between the two modules? And what is the most efficient work-around when working with very large trees?

776

asked Jun 07 '18 20:06

Bram Vanroy

2 Answers

While Louis's answer is correct and I completely agree that modifying a data structure as you traverse it generally a Bad Idea^(tm), you also asked why the code works with xml.etree.ElementTree and not lxml.etree and there is a very reasonable explanation for that.

Implementation of `.append` in `xml.etree.ElementTree`

This library is implemented directly in Python and could vary depending on which Python runtime you're using. Assuming you're using CPython, the implementation you're looking for is implemented in vanilla Python:

def append(self, subelement):
    """Add *subelement* to the end of this element.
    The new element will appear in document order after the last existing
    subelement (or directly after the text, if it's the first subelement),
    but before the end tag for this element.
    """
    self._assert_is_element(subelement)
    self._children.append(subelement)

The last line is the only part we're concerned with. As it turns out, self._children is initialized towards the top of that file as:

self._children = []

So adding a child to a tree is just appending an element to a list. Intuitively, that's exactly what you're looking for (in this case) and the implementation behaves in a completely unsurprising way.

Implementation `.append` in `lxml.etree`

lxml is implemented as a mix of Python, non-trivial Cython, and C code so spelunking through it was significantly harder than the pure-Python implementation. First off, .append is implemented as:

def append(self, _Element element not None):
    u"""append(self, element)
    Adds a subelement to the end of this element.
    """
    _assertValidNode(self)
    _assertValidNode(element)
    _appendChild(self, element)

_appendChild is implemented over in apihelper.pxi:

cdef int _appendChild(_Element parent, _Element child) except -1:
    u"""Append a new child to a parent element.
    """
    c_node = child._c_node
    c_source_doc = c_node.doc
    # prevent cycles
    if _isAncestorOrSame(c_node, parent._c_node):
        raise ValueError("cannot append parent to itself")
    # store possible text node
    c_next = c_node.next
    # move node itself
    tree.xmlUnlinkNode(c_node)
    tree.xmlAddChild(parent._c_node, c_node)
    _moveTail(c_next, c_node)
    # uh oh, elements may be pointing to different doc when
    # parent element has moved; change them too..
    moveNodeToDocument(parent._doc, c_source_doc, c_node)
    return 0

There's definitely a bit more going on here. In particular, lxml explicitly removes the node from the tree and then adds it elsewhere. This prevents you from accidentally creating a cyclic XML graph while manipulating nodes (which is something you could probably do with the xml.etree version).

Workarounds for `lxml`

Now that we know that xml.etree copies nodes when appending but lxml.etree moves them, why do those workarounds work? Based on the tree.xmlUnlinkNode method (which is actually defined in C inside of libxml2), unlinking just messes with a bunch of pointers. So, anything that copies node metadata will do the trick. Because all of the metadata we care about are direct fields on the xmlNode struct, anything that shallow copies nodes will do the trick

copy.deepcopy() definitely works
node.xpath returns nodes wrapped in proxy elements which happens to shallow copy the tree metadata
copy.copy() also does the trick
If you don't need your combinations to actually be in an official tree, setting new_combo_tree = [] also gives you list appending just like xml.etree.

If you're really concerned about performance and large trees, I'd probably start with shallow copying with copy.copy() although you should absolutely profile a few different options and see which one works best for you.

109

answered Sep 21 '22 12:09

supersam654

At the beginning I didn't think there was such a difference (neither did I check), but both @supersam654 and @Louis answers pinpointed it very clearly.

But code that is dependent on internal representation (rather than interface) of stuff that it uses, doesn't seem right (from design PoV) to me. Also, as I was asking in my comment: combo_children seems totally useless:

Get child nodes combo (as a list)
Append each node from the list as a child to combo_children
Return combo_children
Get combo_children children (as a list)
Use the list (combo)

when things could be easily done:

Get child nodes combo (as a list)
Return the list
Use the list (combo)

Apparently, the combo_children approach was also exposing the behavioral difference between the modules.

code_orig_lxml.py:

import lxml.etree as ET
#import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy


def get_combination_trees(tree):
    children = list(tree)
    for i in range(1, len(children)):
        for combination in combinations(children, i):
            #new_combo_tree = ET.Element(tree.tag, tree.attrib)
            #for recombined_child in combination:
                #new_combo_tree.append(recombined_child)
                # when using lxml a deepcopy is required to make this work (or make change in parse_xml)
                # new_combo_tree.append(deepcopy(recombined_child))
            #yield new_combo_tree
            yield combination

    return None


def parse_xml(tree_p):
    for node in ET.fromstring(tree_p):
        if not node.tag == 'node_main':
            continue
        # replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
        for subnode in node.iter('node'):
            children = list(subnode)
            if children:
                print('-'.join([child.attrib['id'] for child in children]))
            else:
                print(f'node {subnode.attrib["id"]} has no children')

            #for combo_tree in get_combination_trees(subnode):
            for combo_children in get_combination_trees(subnode):
                #combo_children = list(combo_tree)
                if combo_children:
                    print('-'.join([child.attrib['id'] for child in combo_children]))

    return None


s = """
<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
"""

parse_xml(s)

Notes:

This is your code with the changes above
I didn't removed anything, instead just commented stuff (which would generate the smallest diff between the old and new versions)

Output:

(py36x86_test) e:\Work\Dev\StackOverflow\q050749937>"e:\Work\Dev\VEnvs\py36x86_test\Scripts\python.exe" code_orig_lxml.py
2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children

While I was investigating, I modified your code further, to:

Fix the issue
Improve printing
Make it modular
Use both parsing methods, to make differences between them clearer

xml_data.py:

DATA = """
<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
"""

code.py:

import sys
import xml.etree.ElementTree as xml_etree_et
import lxml.etree as lxml_etree
from itertools import combinations
from xml_data import DATA


MAIN_NODE_NAME = "node_main"


def get_children_combinations(tree):
    children = list(tree)
    for i in range(1, len(children)):
        yield from combinations(children, i)


def get_tree(xml_str, parse_func, tag=None):
    root_node = parse_func(xml_str)
    if tag:
        return [item for item in root_node if item.tag == tag]
    return [root_node]


def process_xml(xml_node):
    for node in xml_node.iter("node"):
        print(f"\nNode ({node.tag}, {node.attrib['id']})")
        children = list(node)
        if children:
            print("    Children: " + " - ".join([child.attrib["id"] for child in children]))

        for children_combo in get_children_combinations(node):
            if children_combo:
                print("    Combo: " + " - ".join([child.attrib["id"] for child in children_combo]))


def main():
    parse_funcs = (xml_etree_et.fromstring, lxml_etree.fromstring)
    for func in parse_funcs:
        print(f"\nParsing xml using: {func.__module__} {func.__name__}")
        nodes = get_tree(DATA, func, tag=MAIN_NODE_NAME)
        for node in nodes:
            print(f"\nProcessing node: {node.tag}")
            process_xml(node)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

Output:

(py36x86_test) e:\Work\Dev\StackOverflow\q050749937>"e:\Work\Dev\VEnvs\py36x86_test\Scripts\python.exe" code.py
Python 3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:14:34) [MSC v.1900 32 bit (Intel)] on win32


Parsing xml using: xml.etree.ElementTree XML

Processing node: node_main

Node (node, 1)
    Children: 2 - 3
    Combo: 2
    Combo: 3

Node (node, 2)

Node (node, 3)
    Children: 4 - 6
    Combo: 4
    Combo: 6

Node (node, 4)
    Children: 5

Node (node, 5)

Node (node, 6)

Parsing xml using: lxml.etree fromstring

Processing node: node_main

Node (node, 1)
    Children: 2 - 3
    Combo: 2
    Combo: 3

Node (node, 2)

Node (node, 3)
    Children: 4 - 6
    Combo: 4
    Combo: 6

Node (node, 4)
    Children: 5

Node (node, 5)

Node (node, 6)

answered Sep 21 '22 12:09

CristiFati

Related questions
                            
                                PyQt5 QuickControls Material style
                            
                                Permission denied when pandas dataframe to tempfile csv
                            
                                How to give a constant input to keras
                            
                                How to interpret the metrics of Locust?
                            
                                How does isinstance work for List?
                            
                                How does "import __hello__" work?
                            
                                How do I specify URL resolution in python's requests library in a similar fashion to curl's --resolve flag?
                            
                                json reading error json.decoder.JSONDecodeError: Invalid \escape
                            
                                How to sort bars in a bar plot in ascending order
                            
                                How do I call a class method from another file in Python?
                            
                                How to use F1 Score with Keras model?
                            
                                How can I prevent Sphinx from listing "object" as a base class?
                            
                                Python: How come a large float turns into *inf* automatically?
                            
                                PIL: draw transparent text on top of an image
                            
                                pyyaml load number as decimal
                            
                                How to find the argmax of a two dimensional array in numpy?
                            
                                open a .jpg image in python using matplotlib.image in python 3.6
                            
                                Python refactoring fails in Visual Studio Code
                            
                                Dask Dataframe: Get row count?
                            
                                Does anyone have the "Documentation URL"s in PyCharm for the following libraries:

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

xml.etree.ElementTree vs. lxml.etree: different internal node representation?

Tags:

python

python-3.x

xml

lxml

elementtree

Bram Vanroy

People also ask

2 Answers

Implementation of `.append` in `xml.etree.ElementTree`

Implementation `.append` in `lxml.etree`

Workarounds for `lxml`

supersam654

CristiFati

Recent Activity

Donate For Us

xml.etree.ElementTree vs. lxml.etree: different internal node representation?

Tags:

python

python-3.x

xml

lxml

elementtree

Bram Vanroy

People also ask

2 Answers

Implementation of .append in xml.etree.ElementTree

Implementation .append in lxml.etree

Workarounds for lxml

supersam654

CristiFati

Related questions

Recent Activity

Donate For Us

Implementation of `.append` in `xml.etree.ElementTree`

Implementation `.append` in `lxml.etree`

Workarounds for `lxml`