Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does lxml.etree.SubElement() allow making elements which are not serialisable?

from lxml import etree

element1 = etree.Element('{j:a}a', nsmap={None: 'j:a'})
etree.SubElement(element1, 'b')

element2 = etree.Element('{j:a}a', nsmap={None: 'j:a'})
etree.SubElement(element2, '{j:a}b')

both elements serialise to the same

<a xmlns="j:a"><b/></a>

but both elements do not behave the same

element1.find('b') -> returns the Element

element2.find('b') -> returns None

if you do it the other way around

etree.fromstring("<a xmlns="j:a"><b/></a>")

you get the representation from element2, so

element2.find('b') -> returns None

which seems consistent because there is no namespaceless <b/> in the tree, because <b/> inherits the default namespace from <a/>

so what's the purpose of the representation in element1? It seems to add a namespaceless subelement <b/> and behaves that way. But when serialised the element inherits from <a>.

Why does this exist if it does not serialise anyway?

like image 992
lovetox Avatar asked Nov 07 '22 00:11

lovetox


1 Answers

It all comes down to namespaces

xml tags can (but must not) have a namespace. So even if the root node defines a default namespace, child nodes are allowed to not have a namespace, which is not equivalent to be in the default namespace.

This is the difference between your element1 and element2: element1's subelement has no namespace; element2's subelement is in the default namespace, since when you create it you specify the default namespace. If you try

element2.find("{j:l}b")) -> returns the element b, or to be more accurate, the element {j:a}b.

So yes, namespace matters. And when you create the elements with lxml, you can define elements without namespace: just don't add it.

But what about serialization?

Now I am not an lxml expert, so this is just my guess on the point. Thing is when you serialize the element, there is no way to discriminate between elements which are really without namespace and element in the default namespace, so they are represented in the same way.

Consequently, serializing an element and then parsing it again, cannot give the original result. If for example, using your element1 you do:

sel1 = etree.tostring(element1)
element1s = etree.fromstring(sel1)

It turns out that element1s is not equal to element1, because the subelement b now is subelement {j:a}b. When parsing the string, elements without namespace are added to the default namespace.

Conclusion

Now, I don't know if this is intended or is a bug. At the best of my knowledge, if an XML document declares a default namespace, all elements which do not explicitly have a different namespace should be considered in the default namespace. As it happens when you parse an xml document with the fromstring function. You can have a "no namespace" only if no default namespace is declared.
So in my opinion your b subelement of element1 should "inherit" the namespace of the parent node, since parent node defines a default namespace with nsmap={None: "j:a"}.
But you could also be told that since you are building the document using lxml elements, it's your responsibility to put each element in the correct namespace, which means you have to add the default namespace explicitly.

Since elements without namespaces are allowed by xml under some circustances, lxml does not complain when an element do not have a namespace.
I think that automatic addition of the default namespaces to subelement of elements which declare a default namespace would be a cool feature, but it's just not there.

like image 159
Valentino Avatar answered Nov 12 '22 21:11

Valentino