Algorithm / Data structure for largest set intersection in a collection of sets with a given set

Tags:

I have a large collection of several million sets, C. The elements of my sets come from a universe of about 2000 possible elements. I need to know, for a given set, s, which set in C has the largest intersection with s? (Or the k sets in C with the k-largest intersections). I will be making many of these queries, sequentially, for different s.

I know that the obvious way to do this is to just to loop over every set in C and compute the intersection and take the max. Are there any smart data structures / programming tricks that can speed up my search? It would be great if I could do this faster than O(C).

EDIT: approximate answers would be alright too

919

asked Jul 30 '15 23:07

newmanne

2 Answers

I don't think there's a clever data structure that will help with asymptotic performance. But this is a perfect map reduce problem. A GPGPU would do nicely. For a universe of 2048 elements, a set as a bitmap is only 256 bytes. 4 million is only a gigabyte. Even a modestly spec'ed Nvidia has that. E.g. programming in CUDA, you'd copy C to graphics card RAM, map a chunk of the gigabyte to each GPU core for searching and then reduce across cores to find the final answer. This ought to take on the order of a very few milliseconds. Not fast enough? Just buy hotter hardware.

If you re-phrase your question along these lines, you'll probably get answers from experts in this kind of programming, which I'm not.

156

answered Oct 06 '22 13:10

Gene

One simple trick is to sort the list of sets C in decreasing order by size, then proceed with brute force intersection tests as usual. As you go along, keep track of the set b with the biggest intersection so far. If you find a set whose intersection with the query set s has size |s| (or equivalently, has intersection equal to s -- use whichever of these tests is faster), you can immediately stop and return it as this is the best possible answer. Otherwise, if the next set from C has fewer than |b| elements, you can immediately stop and return b. This can easily be generalised to finding the top k matches.

answered Oct 06 '22 14:10

j_random_hacker

Related questions
                            
                                What is the most efficient way of finding the first element of the ith row when A[i,j]=j*(A[i-1,j+1]-A[i-1,j])?
                            
                                Correctness of greedy algorithm
                            
                                Shortest path to visit all nodes
                            
                                Stuck implementing simple neural network
                            
                                Algorithm to find the number of distinct paths in a directed graph [duplicate]
                            
                                A better concurrent prime number sieve in go
                            
                                Which algorithm is being used in Android's spell checker?
                            
                                How to detect squares on a grid which can NEVER be part of a shortest path after adding blocks?
                            
                                Longest repeated (k times) substring
                            
                                Finding k most common words in a file - memory usage
                            
                                Volleyball Player Combination
                            
                                Get permutation with specified degree by index number
                            
                                Algorithm for merging sets that share at least 2 elements
                            
                                Longest Common Subsequence for Multiple Sequences
                            
                                Is there an algorithm for anonymous, changeable, secure voting?
                            
                                How can I optimize this indexing algorithm
                            
                                Get number of elements greater than a number
                            
                                Design Pattern to track partial results of a complex process
                            
                                Traceback in dynamic programming implementation of Needleman-Wunsch algorithm
                            
                                Calculate area given list of directions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Algorithm / Data structure for largest set intersection in a collection of sets with a given set

Tags:

algorithm

data-structures

intersection

set

set-intersection

newmanne

People also ask

2 Answers

Gene

j_random_hacker

Recent Activity

Donate For Us