Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if the two XML files are equivalent with Python?

Tags:

python

xml

How to check if two XML files are equivalent?

For example, the two XML files are the same even though the ordering is different. I need to check if the two XML files content the same textual info disregarding the order.

<a>
   <b>hello</b>
   <c><d>world</d></c>
</a>

<a>
   <c><d>world</d></c>
   <b>hello</b>
</a>

Are there tools for this out there?

like image 442
prosseek Avatar asked Oct 20 '10 12:10

prosseek


2 Answers

It all depends on your definition of "equivalent".

Assuming you really only care about the text nodes (for example: the d tags in your example do not even matter, you only care about the content word), you can just make a set of the text nodes of each document, and compare the sets. Using lxml, this could look like:

from lxml import etree

tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')

print set(tree1.getroot().itertext()) == set(tree2.getroot().itertext())

You might even want to ignore whitespace nodes, doing something like:

set(i for i in tree.getroot().itertext() if i.strip())

Note that using sets means you will NOT take into account how many times certain pieces of text occur in the document (this might be what you want, it might not). If the order is not important, but the number of times something occurs is, you could use a dictionary instead of a set, and keep track of the number of occurences (eg. with collections.defaultdict() or collections.Counter in python 2.7)

But if it is only the order of the direct child elements of the root element (in your case, children of the a element) that may be ignored, and everything inside those elements really counts, you would need another approach. You could for example do xml canonicalization on each child element to get a normalized version of each child (again, I don't know if this is normalized enough for your needs).

from lxml import etree

tree1 = etree.parse('example1.xml')
tree2 = etree.parse('example2.xml')

set1 = set(etree.tostring(i, method='c14n') for i in tree1.getroot())
set2 = set(etree.tostring(i, method='c14n') for i in tree2.getroot())

print set1 == set2

Note: to keep the example simpler, I've used the development version of lxml, in older versions, there is no method='c14n' for etree.tostring(), only a c14n() method on the ElementTree, that writes to a file-like object. So to get it working there, you'd have to copy each element to a tree of its own, and use a StringIO() object as a dummy file)

Also, this way of doing it is probably not recommended with very large files.

But again: a BIG WARNING: you really have to know what you need as "equivalent", and create your own solution based on that knowledge!

like image 79
Steven Avatar answered Nov 03 '22 09:11

Steven


Ordering is important in XML, so the two files you provided are different. Normally you could normalize the XML and then simply compare the files as text, but if you want order-insensitive comparison, you will probably have to implement it yourself using one of the bazillion XML parsers out there (I would recommend lxml, by the way).

like image 2
Gintautas Miliauskas Avatar answered Nov 03 '22 09:11

Gintautas Miliauskas