finding items to de-duplicate

Question

I have a pool of data (X₁..X_N), for which I want to find groups of equal values. Comparison is very expensive, and I can't keep all data in memory.

The result I need is, for example:

X₁ equals X₃ and X₆
X₂ is unique
X₄ equals X₅

(Order of the lines, or order within a line, doesn't matter).

How can I implement that with pair-wise comparisons?

Here's what I have so far:

Compare all pairs (X_i, X_k) with i < k, and exploit transitivity: if I already found X₁==X₃ and X₁==X₆, I don't need to compare X₃ and X₆.

so I could use the following data structure:

  map: index --> group
  multimap: group --> indices

where group is arbitrarily assigned (e.g. "line number" in the output).

For a pair (X_i, X_k) with i < k :

if both i and k already have a group assigned, skip
if they compare equal:
- if i already has a group assigned, put k in that group
- otherwise, create a new group for i and put k in it
if they are not equal:
- if i has no group assigned yet, assign a new group for i
- same for k

That should work if I'm careful with the order of items, but I wonder if this is the best / least surprising way to solve this, as this problem seems to be somewhat common.

Background/More info: purpose is deduplicating storage of the items. They already have a hash, in case of a collision we want to guarantee a full comparison. The size of the data in question has a very sharp long tail distribution.

An iterative algorithm (find any two duplicates, share them, repeat until there are no duplicates left) might be easier, but we want non-modifying diagnostics. Code base is C++, something that works with STL / boost containers or algorithms would be nice.

[edit] Regarding the hash: For the purpose of this question, please assume a weak hash function that cannot be replaced.

This is requried for a one-time deduplication of existing data, and needs to deal with hash collisions. The original choice was "fast hash, and compare on collision", the hash chosen turns out a little bit weak, but changing it would break backward compatibility. Even then, I sleep better with a simple statement: In case of a collision, you won't get the wrong data. instead of blogging about wolf attacks.

darksky · Accepted Answer

Here's another, maybe simpler, data structure for exploiting transitivity. Make a queue of comparisons that you need to do. For example, in case of 4 items, it will be of [ (1,2), (1,3), (1,4), (2,3), (2,4), (3,4) ]. Also have an array for comparisons you've already done. Before each comparison, check to see if that comparison has been done before, and every time you find a match, go through the queue and replace the matching item index with its lower index equivalent.

For example, suppose we pop (1,2), compare, they're not equal, push (1,2) to the array of already_visited and continue. Next, pop (1,3) and find that they are equal. At this point, go through the queue and replace all 3's with 1's. The queue will be [(1,4), (2,1), (2,4), (1,4)], and so on. When we reach (2,1), it has already been visited, so we skip it, and the same with (1,4).

But I do agree with the previous answers. Since comparisons are computationally expensive, you probably want to compute a fast, reliable, hash table first, and only then apply this method to the collisions.

Daren Thomas · Answer

So... you already have a hash? How about this:

sort and group on hash
print all groups with size 1 as unique
compare collisions

Tip for comparing colisions: Why not just rehash them with a different algorithm? Rinse, repeat.

(I am assuming you are storing files/blobs/images here and have hashes of them and that you can slurp the hashes into memory, also, that the hashes are like sha1/md5 etc., so collisions are very unlikely)

(also, I'm assuming that two different hashing algorithms will not collide on different data, but this is probably safe to assume...)

finding items to de-duplicate

Tags:

c++

language-agnostic

algorithm

deduplication

peterchen

2 Answers

darksky

Daren Thomas

Recent Activity

Donate For Us

finding items to de-duplicate

Tags:

c++

language-agnostic

algorithm

deduplication

peterchen

2 Answers

darksky

Daren Thomas

Related questions

Recent Activity

Donate For Us