Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to perform efficient sets intersection operation?

Tags:

c++

algorithm

I got elements called tokens.Each of them is associative container of some kind (not necessarily one of std containers).I got some type of container storage which stores tokens (not necessarily one of std containers).storages are sets of tokens.
I need ability to perform intersection operation on sets of tokens (storage) on token value by specified key with specified comparator. As result of that operations I want to get another set of tokens ( another storage ).

use case in pseudocode:

  if ( (storage0[key1]==storage1[key1])[key2]<storage1[key2] )
    ...

And I'm looking for a efficient algorithm of doing that.
note: I got dozens of tokens.
questions:
1) how to organize storage?
2) how to realize intersection operation?

update some more explanations :
token is a set of (key,value) pairs. storage is a set of sets of (key,value) pairs

I need intersect(P1,K1,P2,K2,cmp)
P1 - first storage
P2 - second storage
K1 - key for first storage
K2 - key for second storage
cmp - comparison function like cmp(value,value), which return true or false

Intersect should go through each element e1 of P1, and throught each element e2 of P2 and extract those elements ((key,value) pairs) which satisfy cmp(e1[K1],e2[K2])

like image 870
2r2w Avatar asked Nov 13 '22 10:11

2r2w


1 Answers

Inverted index handles intersection efficiently, so you can do something similar to it.

The idea is: each set will actually be a list, with an efficient getFirstAfter(key) function - which returns the first token after key. For each set - you need to check if the relevant element is in it - and if not - advance to the next element which is in the set.

(*) Note that tokens are enumerated in this algorithm

(*)Assume T is holding all the lists you want to intersect [the algorithm efficiently intersect more then two lists]

Psudo code:

lastTok <- 0 //the first token in the collection
currSet <- 0 //the first set 
while (lastTok != infinity):
  if (currSet > T.last): //if we have passed the last set
     insert lastTok into result
     currSet <- 0
     lastTok <- lastTok + 1
     continue
  currentToken<- T[currSet].getFirstAfter(lastTok-1)
  if (currentToken!= lastTok):
     lastTok<- currentToken
     currSet <- 0
  else: 
     currSet <- currSet + 1

This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.

The algorithm will be most efficient if the terms are sorted such that the rarest term is first.

The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.

More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]

like image 60
amit Avatar answered Jan 04 '23 22:01

amit