Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm / Data structure for largest set intersection in a collection of sets with a given set

I have a large collection of several million sets, C. The elements of my sets come from a universe of about 2000 possible elements. I need to know, for a given set, s, which set in C has the largest intersection with s? (Or the k sets in C with the k-largest intersections). I will be making many of these queries, sequentially, for different s.

I know that the obvious way to do this is to just to loop over every set in C and compute the intersection and take the max. Are there any smart data structures / programming tricks that can speed up my search? It would be great if I could do this faster than O(C).

EDIT: approximate answers would be alright too

like image 919
newmanne Avatar asked Jul 30 '15 23:07

newmanne


People also ask

What is intersection in data structure?

A set intersection oracle (SIO) is a data structure which represents a collection of sets and can quickly answer queries about whether the set intersection of two given sets is non-empty. The input to the problem is n finite sets.

What is the time complexity of set intersection?

What's the time complexity of set intersection in Python? The time complexity of set intersection in Python on a set with n elements and a set argument with m elements is O(min(n, m)) because you need to check for the smaller set whether each of its elements is a member of the larger set.

How do you find the intersection of a set?

The intersection of two or more given sets is the set of elements that are common to each of the given sets. The intersection of sets is denoted by the symbol '∩'. In the case of independent events, we generally use the multiplication rule, P(A ∩ B) = P( A )P( B ).

What is a set algorithm?

Set algorithms are input-specialized algorithms that deal with sets. They implement basic mathematical set operations over sets with generic element types. STL implements set containers with red-black trees. The reason for this is that operations with sets require fast and bounded search on set members.


2 Answers

I don't think there's a clever data structure that will help with asymptotic performance. But this is a perfect map reduce problem. A GPGPU would do nicely. For a universe of 2048 elements, a set as a bitmap is only 256 bytes. 4 million is only a gigabyte. Even a modestly spec'ed Nvidia has that. E.g. programming in CUDA, you'd copy C to graphics card RAM, map a chunk of the gigabyte to each GPU core for searching and then reduce across cores to find the final answer. This ought to take on the order of a very few milliseconds. Not fast enough? Just buy hotter hardware.

If you re-phrase your question along these lines, you'll probably get answers from experts in this kind of programming, which I'm not.

like image 156
Gene Avatar answered Oct 06 '22 13:10

Gene


One simple trick is to sort the list of sets C in decreasing order by size, then proceed with brute force intersection tests as usual. As you go along, keep track of the set b with the biggest intersection so far. If you find a set whose intersection with the query set s has size |s| (or equivalently, has intersection equal to s -- use whichever of these tests is faster), you can immediately stop and return it as this is the best possible answer. Otherwise, if the next set from C has fewer than |b| elements, you can immediately stop and return b. This can easily be generalised to finding the top k matches.

like image 2
j_random_hacker Avatar answered Oct 06 '22 14:10

j_random_hacker