Efficiently compare two XML files in python

Question

I'm trying to find an efficient approach to compare two XML files and handle the differences in a python script. The scenario is that I have two XML files similar to the following on:

<?xml version="1.0" encoding="UTF-8"?> 
<garage> 
    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>10000</price>
    </car> 
    <car> 
        <color>blue</color> 
        <size>big</size> 
        <price>10000</price>

    <!-- [...] -->

    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>11000</price>
    </car> 
    </car> 
</garage>

Those XML files contain thousands of small objects. The files themselves have a size of about 5 MB. The tricky thing is that only a very few entries of the two files differ and that I only need to handle the information that differs. With other words: I need to efficiently (!) find out, which of the entries changed or have been added. Unfortunately the XML files also contain some optional entries that I don't care about at all.

I considered the following solutions:

Parse both files into a DOM tree and compare them in a loop
Parse both files into sets and use operators like set.difference
Try to hand some of the processing over to some linux tools like grep and diff

Does anybody here have experiences with the performance of such approaches and can guide me a direction to walk into?

Preet Kukreti · Accepted Answer

Create a cached intermediate format that only has the stuff you care about comparing. When comparing two files, A.xml & B.xml, compare their A.cached and B.cached instead, generating them if missing and removing on file change (or re-generating based on timestamp etc). The generation cost will be amortized over multiple comparisons, and you will not be iterating over unnecessary entries.

The format of ".cached" really depends on what you care about and how much information/context you need. It could perhaps even potentially have a binary representation

Efficiently compare two XML files in python

Tags:

python

xml

Norbert

1 Answers

Preet Kukreti

Recent Activity

Donate For Us

Efficiently compare two XML files in python

Tags:

python

xml

Norbert

1 Answers

Preet Kukreti

Related questions

Recent Activity

Donate For Us