Fastest Set Operations In The West

Question

I haven't been able to find any satisfactory coverage of this topic all in one place, so I was wondering: What are the fastest set intersect, union, and disjoin algorithms?
Are there any interesting ones with limited domains?
Can anyone beat O(Z) where Z is the actual size of intersection?

If your approach relies on sorted sets, please note that, but don't consider it a disqualifying factor. It seems to me that there must be a veritable storehouse of subtle optimizations to be shared, and I don't want to miss any of them.

A few algorithms I know rely on bitwise operations beyond the vanilla, so you may assume the presence of SSE4 and access to intrinsics like popcount. Please note this assumption.

Of interest: An Implementation of B-Y Intersect

Update
We've got some really good partial answers, but I'm still hoping for some more complete attacks on the problem. I'm particularly interested in seeing a more fully articulated use of bloom filters in attacking the problem.

Update
I've done some preliminary work on combining bloom filters with a cuckoo hash table. It's looking almost obnoxiously promising, because they have very similar demands. I've gone ahead and accepted an answer, but I'm not really satisfied at the moment.

Thomas M. DuBuisson · Accepted Answer

If you're willing to consider set-like structures then bloom filters have trivial union and intersect operations.

Rafe · Answer

For reasonably dense sets, interval lists can beat O(n) for the operations you specify, where n is the number of elements in the set.

An interval list is essentially a strictly increasing list of numbers, [a1, b1, a2, b2, ..., an, bn], where each pair ai, bi denotes the interval [ai, bi). The strictly increasing constraint ensures that every describable set has a unique representation. Representing a set as an ordered collection of intervals allows your set operations to deal with multiple consecutive elements on each iteration.

Dialecticus · Answer

If set is actually a hashed set and both sets have the same hash function and table size then we can skip all buckets that exist only in one set. That could narrow search a bit.

smossen · Answer

The following paper presents algorithms for union, intersection and difference on ordered sets that beat O(Z) if the intersection is larger than the difference (Z > n/2):

Confluently Persistent Sets and Maps

Ali Tarhini · Answer

there is no optimal solution than O(Z) because if you think of the problem logically each of the intersect, union and disjoin algorithms must at least read all of the input elements once, so Z reads is a must. also since the set is not sorted by default, no further optimizations could beat O(Z)

Fastest Set Operations In The West

Tags:

language-agnostic

algorithm

set-intersection

Jake Kurzer

5 Answers

Thomas M. DuBuisson

Rafe

Dialecticus

smossen

Ali Tarhini

Recent Activity

Donate For Us

Fastest Set Operations In The West

Tags:

language-agnostic

algorithm

set-intersection

Jake Kurzer

5 Answers

Thomas M. DuBuisson

Rafe

Dialecticus

smossen

Ali Tarhini

Related questions

Recent Activity

Donate For Us