Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently compare two XML files in python

Tags:

python

xml

I'm trying to find an efficient approach to compare two XML files and handle the differences in a python script. The scenario is that I have two XML files similar to the following on:

<?xml version="1.0" encoding="UTF-8"?> 
<garage> 
    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>10000</price>
    </car> 
    <car> 
        <color>blue</color> 
        <size>big</size> 
        <price>10000</price>

    <!-- [...] -->

    <car> 
        <color>red</color> 
        <size>big</size> 
        <price>11000</price>
    </car> 
    </car> 
</garage>

Those XML files contain thousands of small objects. The files themselves have a size of about 5 MB. The tricky thing is that only a very few entries of the two files differ and that I only need to handle the information that differs. With other words: I need to efficiently (!) find out, which of the entries changed or have been added. Unfortunately the XML files also contain some optional entries that I don't care about at all.

I considered the following solutions:

  1. Parse both files into a DOM tree and compare them in a loop
  2. Parse both files into sets and use operators like set.difference
  3. Try to hand some of the processing over to some linux tools like grep and diff

Does anybody here have experiences with the performance of such approaches and can guide me a direction to walk into?

like image 361
Norbert Avatar asked Nov 11 '22 20:11

Norbert


1 Answers

Create a cached intermediate format that only has the stuff you care about comparing. When comparing two files, A.xml & B.xml, compare their A.cached and B.cached instead, generating them if missing and removing on file change (or re-generating based on timestamp etc). The generation cost will be amortized over multiple comparisons, and you will not be iterating over unnecessary entries.

The format of ".cached" really depends on what you care about and how much information/context you need. It could perhaps even potentially have a binary representation

like image 95
Preet Kukreti Avatar answered Nov 15 '22 01:11

Preet Kukreti