I'm trying to find an efficient approach to compare two XML files and handle the differences in a python script. The scenario is that I have two XML files similar to the following on:
<?xml version="1.0" encoding="UTF-8"?>
<garage>
<car>
<color>red</color>
<size>big</size>
<price>10000</price>
</car>
<car>
<color>blue</color>
<size>big</size>
<price>10000</price>
<!-- [...] -->
<car>
<color>red</color>
<size>big</size>
<price>11000</price>
</car>
</car>
</garage>
Those XML files contain thousands of small objects. The files themselves have a size of about 5 MB. The tricky thing is that only a very few entries of the two files differ and that I only need to handle the information that differs. With other words: I need to efficiently (!) find out, which of the entries changed or have been added. Unfortunately the XML files also contain some optional entries that I don't care about at all.
I considered the following solutions:
Does anybody here have experiences with the performance of such approaches and can guide me a direction to walk into?
Create a cached intermediate format that only has the stuff you care about comparing. When comparing two files, A.xml
& B.xml
, compare their A.cached
and B.cached
instead, generating them if missing and removing on file change (or re-generating based on timestamp etc). The generation cost will be amortized over multiple comparisons, and you will not be iterating over unnecessary entries.
The format of ".cached
" really depends on what you care about and how much information/context you need. It could perhaps even potentially have a binary representation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With