Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding items to de-duplicate

I have a pool of data (X1..XN), for which I want to find groups of equal values. Comparison is very expensive, and I can't keep all data in memory.

The result I need is, for example:

X1 equals X3 and X6
X2 is unique
X4 equals X5

(Order of the lines, or order within a line, doesn't matter).

How can I implement that with pair-wise comparisons?


Here's what I have so far:

Compare all pairs (Xi, Xk) with i < k, and exploit transitivity: if I already found X1==X3 and X1==X6, I don't need to compare X3 and X6.

so I could use the following data structure:

  map: index --> group
  multimap: group --> indices

where group is arbitrarily assigned (e.g. "line number" in the output).

For a pair (Xi, Xk) with i < k :

  • if both i and k already have a group assigned, skip

  • if they compare equal:

    • if i already has a group assigned, put k in that group
    • otherwise, create a new group for i and put k in it
  • if they are not equal:

    • if i has no group assigned yet, assign a new group for i
    • same for k

That should work if I'm careful with the order of items, but I wonder if this is the best / least surprising way to solve this, as this problem seems to be somewhat common.


Background/More info: purpose is deduplicating storage of the items. They already have a hash, in case of a collision we want to guarantee a full comparison. The size of the data in question has a very sharp long tail distribution.

An iterative algorithm (find any two duplicates, share them, repeat until there are no duplicates left) might be easier, but we want non-modifying diagnostics. Code base is C++, something that works with STL / boost containers or algorithms would be nice.

[edit] Regarding the hash: For the purpose of this question, please assume a weak hash function that cannot be replaced.

This is requried for a one-time deduplication of existing data, and needs to deal with hash collisions. The original choice was "fast hash, and compare on collision", the hash chosen turns out a little bit weak, but changing it would break backward compatibility. Even then, I sleep better with a simple statement: In case of a collision, you won't get the wrong data. instead of blogging about wolf attacks.

like image 865
peterchen Avatar asked Oct 21 '22 04:10

peterchen


2 Answers

Here's another, maybe simpler, data structure for exploiting transitivity. Make a queue of comparisons that you need to do. For example, in case of 4 items, it will be of [ (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) ]. Also have an array for comparisons you've already done. Before each comparison, check to see if that comparison has been done before, and every time you find a match, go through the queue and replace the matching item index with its lower index equivalent.

For example, suppose we pop (1,2), compare, they're not equal, push (1,2) to the array of already_visited and continue. Next, pop (1,3) and find that they are equal. At this point, go through the queue and replace all 3's with 1's. The queue will be [(1,4), (2,1), (2,4), (1,4)], and so on. When we reach (2,1), it has already been visited, so we skip it, and the same with (1,4).

But I do agree with the previous answers. Since comparisons are computationally expensive, you probably want to compute a fast, reliable, hash table first, and only then apply this method to the collisions.

like image 154
darksky Avatar answered Oct 24 '22 10:10

darksky


So... you already have a hash? How about this:

  • sort and group on hash
  • print all groups with size 1 as unique
  • compare collisions

Tip for comparing colisions: Why not just rehash them with a different algorithm? Rinse, repeat.

(I am assuming you are storing files/blobs/images here and have hashes of them and that you can slurp the hashes into memory, also, that the hashes are like sha1/md5 etc., so collisions are very unlikely)

(also, I'm assuming that two different hashing algorithms will not collide on different data, but this is probably safe to assume...)

like image 24
Daren Thomas Avatar answered Oct 24 '22 10:10

Daren Thomas