Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest Set Operations In The West

I haven't been able to find any satisfactory coverage of this topic all in one place, so I was wondering: What are the fastest set intersect, union, and disjoin algorithms?
Are there any interesting ones with limited domains?
Can anyone beat O(Z) where Z is the actual size of intersection?

If your approach relies on sorted sets, please note that, but don't consider it a disqualifying factor. It seems to me that there must be a veritable storehouse of subtle optimizations to be shared, and I don't want to miss any of them.

A few algorithms I know rely on bitwise operations beyond the vanilla, so you may assume the presence of SSE4 and access to intrinsics like popcount. Please note this assumption.

Of interest: An Implementation of B-Y Intersect

Update
We've got some really good partial answers, but I'm still hoping for some more complete attacks on the problem. I'm particularly interested in seeing a more fully articulated use of bloom filters in attacking the problem.

Update
I've done some preliminary work on combining bloom filters with a cuckoo hash table. It's looking almost obnoxiously promising, because they have very similar demands. I've gone ahead and accepted an answer, but I'm not really satisfied at the moment.

like image 704
Jake Kurzer Avatar asked Nov 23 '10 22:11

Jake Kurzer


5 Answers

If you're willing to consider set-like structures then bloom filters have trivial union and intersect operations.

like image 86
Thomas M. DuBuisson Avatar answered Oct 16 '22 21:10

Thomas M. DuBuisson


For reasonably dense sets, interval lists can beat O(n) for the operations you specify, where n is the number of elements in the set.

An interval list is essentially a strictly increasing list of numbers, [a1, b1, a2, b2, ..., an, bn], where each pair ai, bi denotes the interval [ai, bi). The strictly increasing constraint ensures that every describable set has a unique representation. Representing a set as an ordered collection of intervals allows your set operations to deal with multiple consecutive elements on each iteration.

like image 43
Rafe Avatar answered Oct 16 '22 21:10

Rafe


If set is actually a hashed set and both sets have the same hash function and table size then we can skip all buckets that exist only in one set. That could narrow search a bit.

like image 21
Dialecticus Avatar answered Oct 16 '22 20:10

Dialecticus


The following paper presents algorithms for union, intersection and difference on ordered sets that beat O(Z) if the intersection is larger than the difference (Z > n/2):

Confluently Persistent Sets and Maps

like image 23
smossen Avatar answered Oct 16 '22 21:10

smossen


there is no optimal solution than O(Z) because if you think of the problem logically each of the intersect, union and disjoin algorithms must at least read all of the input elements once, so Z reads is a must. also since the set is not sorted by default, no further optimizations could beat O(Z)

like image 23
Ali Tarhini Avatar answered Oct 16 '22 22:10

Ali Tarhini