I've got two ways of fetching a bunch of data. The data is stored in a sorted vector<map<string, int> >
.
I want to identify whether there are inconsistencies between the two vectors.
What I'm currently doing (pseudo-code):
for i in 0... min(length(vector1), length(vector2)):
for (k, v) in vector1[i]:
if v != vector2[i][k]:
// report that k is bad for index i,
// with vector1 having v, vector2 having vector2[i][k]
for i in 0... min(length(vector1), length(vector2)):
for (k, v) in vector2[i]:
if v != vector1[i][k]:
// report that k is bad for index i,
// with vector2 having v, vector1 having vector1[i][k]
This works in general, but breaks horribly if vector1
has a, b, c, d
and vector2
has a, b, b1, c, d
(it reports brokenness for b1
, c
, and d
). I'm after an algorithm that tells me that there's an extra entry in vector2
compared to vector1
.
I think I want to do something where when I encountered mismatches entries, I look at the next entries in the second vector, and if a match is found before the end of the second vector, store the index i
of the entry found in the second vector, and move to matching the next entry in the first vector, beginning with vector2[i+1]
.
Is there a neater way of doing this? Some standard algorithm that I've not come across?
I'm working in C++, so C++ solutions are welcome, but solutions in any language or pseudo-code would also be great.
Given the arbitrary map objects: a
, b
, c
, d
, e
, f
and g
;
With vector1
: a
, b
, d
, e
, f
and vector2
: a
, c
, e
, f
I want an algorithm that tells me either:
Extra
b
at index 1 ofvector1
, andvector2's c != vector1's d
.
or (I'd view this as an effectively equivalent outcome)
vector1's b != vector2's c
and extrad
at index 2 ofvector1
I ended up using std::set_difference
, and then doing some matching on the diffs from both sets to work out which entries were similar but different, and which had entries completely absent from the other vector.
Something like the std::mismatch algorithm
You could also use std::set_difference
It sounds like you're looking for the diff algorithm. The idea is to identify the longest common subsequence of the two vectors (using map equality), then recurse down the non-common portions. Eventually you'll have an alternating list of vector sub-sequences that are identical, and sub-sequences that have no common elements. You can then easily produce whatever output you like from this.
Apply it to the two vectors, and there you go.
Note that since map comparison is expensive, if you can hash the maps (use a strong hash - collisions will result in incorrect output) and use the hashes for comparisons you'll save a lot of time.
Once you're down to the mismatched subsequences at the end, you'll have something like:
Input vectors: a b c d e f, a b c' d e f
Output:
COMMON a b
LEFT c
RIGHT c'
COMMON d e f
You can then individually compare the maps c
and c'
to figure out how they differ.
If you have a mutation and insertion next to each other, it gets more complex:
Input vectors: a b V W d e f, a b X Y d e f
Output:
COMMON a b
LEFT V W
RIGHT X Y
COMMON d e f
Determining whether to match V
and W
against X
or Y
(or not at all) is something you'll have to come up with a heuristic for.
Of course, if you don't care about how the content of the maps differ, then you can stop here, and you have the output you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With