I haven't been able to find any satisfactory coverage of this topic all in one place, so I was wondering:
What are the fastest set intersect, union, and disjoin algorithms?
Are there any interesting ones with limited domains?
Can anyone beat O(Z) where Z is the actual size of intersection?
If your approach relies on sorted sets, please note that, but don't consider it a disqualifying factor. It seems to me that there must be a veritable storehouse of subtle optimizations to be shared, and I don't want to miss any of them.
A few algorithms I know rely on bitwise operations beyond the vanilla, so you may assume the presence of SSE4 and access to intrinsics like popcount. Please note this assumption.
Of interest: An Implementation of B-Y Intersect
Update
We've got some really good partial answers, but I'm still hoping for some more complete attacks on the problem. I'm particularly interested in seeing a more fully articulated use of bloom filters in attacking the problem.
Update
I've done some preliminary work on combining bloom filters with a cuckoo hash table. It's looking almost obnoxiously promising, because they have very similar demands. I've gone ahead and accepted an answer, but I'm not really satisfied at the moment.
If you're willing to consider set-like structures then bloom filters have trivial union and intersect operations.
For reasonably dense sets, interval lists can beat O(n) for the operations you specify, where n is the number of elements in the set.
An interval list is essentially a strictly increasing list of numbers, [a1, b1, a2, b2, ..., an, bn], where each pair ai, bi denotes the interval [ai, bi). The strictly increasing constraint ensures that every describable set has a unique representation. Representing a set as an ordered collection of intervals allows your set operations to deal with multiple consecutive elements on each iteration.
If set is actually a hashed set and both sets have the same hash function and table size then we can skip all buckets that exist only in one set. That could narrow search a bit.
The following paper presents algorithms for union, intersection and difference on ordered sets that beat O(Z) if the intersection is larger than the difference (Z > n/2):
Confluently Persistent Sets and Maps
there is no optimal solution than O(Z) because if you think of the problem logically each of the intersect, union and disjoin algorithms must at least read all of the input elements once, so Z reads is a must. also since the set is not sorted by default, no further optimizations could beat O(Z)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With