I'm interested in equivalence of two xml elements; and I've found that testing the tostring of the elements works; however, that seems hacky.
Is there a better way to test equivalence of two etree Elements?
Comparing Elements directly:
import xml.etree.ElementTree as etree h1 = etree.Element('hat',{'color':'red'}) h2 = etree.Element('hat',{'color':'red'}) h1 == h2 # False
Comparing Elements as strings:
etree.tostring(h1) == etree.tostring(h2) # True
ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with.
Parsing from strings and files. lxml. etree supports parsing XML in a number of ways and from all important sources, namely strings, files, URLs (http/ftp) and file-like objects. The main parse functions are fromstring() and parse(), both called with the source as first argument.
This compare function works for me:
def elements_equal(e1, e2): if e1.tag != e2.tag: return False if e1.text != e2.text: return False if e1.tail != e2.tail: return False if e1.attrib != e2.attrib: return False if len(e1) != len(e2): return False return all(elements_equal(c1, c2) for c1, c2 in zip(e1, e2))
Comparing strings doesn't always work. The order of the attributes should not matter for considering two nodes equivalent. However, if you do string comparison, the order obviously matters.
I'm not sure if it is a problem or a feature, but my version of lxml.etree preserves the order of the attributes if they are parsed from a file or a string:
>>> from lxml import etree >>> h1 = etree.XML('<hat color="blue" price="39.90"/>') >>> h2 = etree.XML('<hat price="39.90" color="blue"/>') >>> etree.tostring(h1) == etree.tostring(h2) False
This might be version-dependent (I use Python 2.7.3 with lxml.etree 2.3.2 on Ubuntu); I remember that I couldn't find a way of controlling the order of the attributes a year ago or so, when I wanted to (for readability reasons).
As I need to compare XML files that were produced by different serializers, I see no other way than recursively comparing tag, text, attributes, and children of every node. And of course tail, if there's anything interesting there.
Comparison of lxml and xml.etree.ElementTree
The truth is that it may be implementation dependent. Apparently, lxml uses ordered dict or something like that, the standard xml.etree.ElementTree does not preserve the order of attributes:
Python 2.7.1 (r271:86832, Nov 27 2010, 17:19:03) [MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> h1 = etree.XML('<hat color="blue" price="39.90"/>') >>> h2 = etree.XML('<hat price="39.90" color="blue"/>') >>> etree.tostring(h1) == etree.tostring(h2) False >>> etree.tostring(h1) '<hat color="blue" price="39.90"/>' >>> etree.tostring(h2) '<hat price="39.90" color="blue"/>' >>> etree.dump(h1) <hat color="blue" price="39.90"/>>>> etree.dump(h2) <hat price="39.90" color="blue"/>>>>
(Yes, the newlines are missing. But it is a minor problem.)
>>> import xml.etree.ElementTree as ET >>> h1 = ET.XML('<hat color="blue" price="39.90"/>') >>> h1 <Element 'hat' at 0x2858978> >>> h2 = ET.XML('<hat price="39.90" color="blue"/>') >>> ET.dump(h1) <hat color="blue" price="39.90" /> >>> ET.dump(h2) <hat color="blue" price="39.90" /> >>> ET.tostring(h1) == ET.tostring(h2) True >>> ET.dump(h1) == ET.dump(h2) <hat color="blue" price="39.90" /> <hat color="blue" price="39.90" /> True
Another question may be what is considered unimportant whan comparing. For example, some fragments may contain extra spaces and we do not want to care. This way, it is always better to write some serializing function that works exactly we need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With