Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structures for fast intersection operations?

Randomly select two sets ,both set contains distinct keys (one key may belongs to multiple sets ,one set can never contain duplicate keys ).

Return a integer which represents for the number of keys belongs to both sets .

For example intersect({1,2,3,4},{3,4,5}) returns 2 .

I just need the size of the intersection .I don't need to know exactly which keys are there in the intersection .

Are there any datastructures support this kind of operations in less than O(n) time ?

Edit:

Reading the data out does requires O(n) time, but that would not lead to the conclusion that you can't do the intersection operation in less than O(n) time .

Image this scenario:

I have N sets,each contains 100 keys . I read them up, that's N*100 operations .Now I want to know witch pair of sets have the largest intersection ,that's O(N²) intersection operations .So I want to reduce the complexity of the intersection operation .I don't really care how much time it takes to read and construct the sets ,it's at most N*100,that's nothing compares to O(N²) intersection operations .

Be aware ,there's no way that you can find the pair of sets that have the largest intersection by doing less than O(N²) intersection operations ,I can prove that .You must do all the intersection operations .

(he basic idea is ,let's imagine a complete graph ,with N vertices ,each represent for a set ,and Nx(N-1)/2 edges, each represents for a intersection for the connected pair .Now give each edge an non-negetive weight all you want(represents for the intersection size) ,I can always construct N sets satisfy those Nx(N-1)/2 edge weights .That proves my claim .)

like image 949
iouvxz Avatar asked Aug 25 '16 21:08

iouvxz


People also ask

What is intersection in data structure?

A set intersection oracle (SIO) is a data structure which represents a collection of sets and can quickly answer queries about whether the set intersection of two given sets is non-empty. The input to the problem is n finite sets.

What is the time complexity of set intersection?

What's the time complexity of set intersection in Python? The time complexity of set intersection in Python on a set with n elements and a set argument with m elements is O(min(n, m)) because you need to check for the smaller set whether each of its elements is a member of the larger set.


2 Answers

I would advice you to take a look at the two possible alternatives, which work particularly well in practice (especially in case of the large sets).

1. The Bloom Filter data structure

A Bloom filter is a space-efficient (based on the bit array) probabilistic data structure, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.

There is a trade-off between False Positive rate and the memory footprint of the Bloom Filter. Hence, it is possible to estimate the appropriate size of the Bloom Filter for different use cases.

Each set can be associated with its own Bloom Filter. It is very easy to obtain the Bloom Filter, which corresponds to the intersection of the different sets: all bit arrays, which correspond to the different Bloom Filters, can be combined using the bitwise AND operation.

Having the Bloom Filter, which corresponds to the intersection, it is possible to find quickly the items, which are present in all intersected sets.

Apart from that, it is possible to estimate the cardinality of the intersection without the actual iteration over the entire sets: https://en.wikipedia.org/wiki/Bloom_filter#The_union_and_intersection_of_sets

2. The Skip list data structure

A Skip List is a data structure that allows fast search and intersection within an ordered sequence of elements. Fast search and intersection are made possible by maintaining a linked hierarchy of subsequences, each skipping over fewer elements.

Succinctly saying, the Skip List is very similar to the plain Linked List data structure, however each node of the Skip List has a couple of additional pointers to items, which are located further (pointers, which "skips" over the couple of other nodes of the list).

Hence, in order to obtain the intersection - it is needed to maintain the pointers inside all Skip Lists, which are being intersected. During the intersection of the Skip Lists pointers are jumping over the items, which are not present in all intersected Skip Lists. Hence, usually the runtime complexity of the intersection operation is faster then O(N).

The algorithm of the intersection of Skip Lists is described in the book "Introduction to Information Retrieval" (written by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze): http://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html

Skip Lists are actively used in a high-performance, full-featured text search engine library: Apache Lucene (Skip Lists are used inside the Inverted Index component).

Here is an additional Stackoverflow question about the usage of Skip Lists in Lucene: how lucene use skip list in inverted index?

like image 169
stemm Avatar answered Sep 22 '22 21:09

stemm


Let's assume there is an algorithm which allows checking the intersection length in less then O(n) time. Now let's read part of the input. We have two options: We've read whole set and part of another or we've read part of first set and part of the other.

Option 1):

counter-example - let's take such input that there exists an element which was read in set 1 and hasn't been read from set 2 but it is in set 2 - we'll receive incorrect result.

Option 2):

counter-example - we can have input such there exists element that is in two sets but hasn't been read in at least one. We receive incorrect result.

OK, we've proven that there is no such algorithm that returns the correct result when we don't read the whole input.

Let's read the whole input - n numbers. Oops, the complexity is O(n).

End of proof.

like image 24
xenteros Avatar answered Sep 24 '22 21:09

xenteros