Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently find all sets S from a collection of sets C that are contained in a set D

I am starting with a collection C of target sets. Each set contains words (or strings without spaces). As I iterate over sentences, I can consider the sentence as an observed set of words D. My problem is that, for each sentence, I want to find all of the sets S in C such that D contains S. In other words, I want to find all sets of words in C for which all of their words are in the sentence.

For example, consider the following contents for C:

  {fell, ate}
  {cat, snail, tiger}
  {tree, bush, jalepeno}
  {pen, paperclip, stapler}

Now if I see the sentence "The tree fell over on the bush because it ate a jalepeno.", I would want to return the follwing two sets.

  {fell, ate}
  {tree, bush, jalepeno}

This seems somewhat similar to a trie. However, it is only similar because I am not matching on all the words in the sentence, but rather any subset. One idea was to represent the collection Cin a trie-like data structure with a Map[String, PseudoTrie] at each level, and follow multiple paths through it as I iterate over the words in the sentence in sorted order.

I understand this is similar to a search query. However, it's fine if I need to iterate over all the sentences, so long as the computation for each sentence is fast. Iterating over each set in C and performing a contains is too slow.

UPDATE

I thought people might be interested in some of the results I had. These timings are on 100000 sentences and each target set C in D has about 4 or 5 words.

  1. Original application. Iterate over all target sets and see if they are contained in the sentence, where the sentence is represented by a set of words. 227 s

  2. Filtering by vocabulary. The same as the original process except sentences are represented by the set of words in that sentence that are also in some target set. The advantage is we only need to consider a subset of the words in a sentence when doing the comparisons. 110 s

  3. Strings are translated to an integer key starting from 0. This also includes the filtering in trial #2 by necessity. 86 s

  4. Add an inverted index. 4 s

I also tried the BitSets with an inverted index. It took 20 s, so I didn't stick with them. I'm not sure why the slowdown--I may have done something wrong.

Also, I think my original idea is great (many layers of inverted indices) but it's rather complicated, and I have a pretty good speedup already!

like image 435
schmmd Avatar asked Oct 26 '11 00:10

schmmd


2 Answers

We'll start with the corpus of sentences you want to search against:

val corpus = Seq(
  Set("fell", "ate"),
  Set("cat", "snail", "tiger"),
  Set("tree", "bush", "jalapeno"),
  Set("pen", "paperclip", "stapler")
)

One fairly efficient way of representing this is as a table of bits, with vocabulary types as the columns and sentences as the rows. We define a couple of functions for converting to this representation:

import scala.collection.immutable.BitSet

def vocabulary(c: Seq[Set[String]]) = c.reduce(_ union _).zipWithIndex.toMap

def convert(s: Set[String], v: Map[String, Int]) = (BitSet.empty /: s) {
  (b, w) => v.get(w).map(b + _).getOrElse(b)
}

And a function to search a corpus c for all sentences that a given sentence s contains:

def search(s: BitSet, c: Seq[BitSet]) = c.filter(x => (x & s) == x)

This is going to be pretty damn fast, since it's just a bitwise "and" and an equality comparison for each sentence in the corpus. We can test:

val vocab = vocabulary(corpus)
val table = corpus.map(convert(_, vocab))

val sentence = convert(
  "The tree fell over on the bush because it ate a jalapeno".split(" ").toSet,
  vocab
)

val result = search(sentence, table)

Which gives us List(BitSet(2, 6), BitSet(5, 7, 10)). To confirm that this is what we wanted:

val bacov = vocab.map(_.swap).toMap
result.map(_.map(bacov(_)))

This is List(Set(fell, ate), Set(jalapeno, tree, bush)), as desired.

like image 90
Travis Brown Avatar answered Oct 30 '22 11:10

Travis Brown


An inverted index can be quite useful for this sort of problem. As a starting point, consider creating a mapping from the words in C to a list of all the sets containing that word, thus having type Map[String, List[Set[String]]]; this is the inverted index.

Using the inverted index, you can find the sets which are contained in D, without ever checking those sets which have an empty intersection with D. Just iterate through the list of sets for each of the distinct words in D, keeping track of how many times each set is encountered. Compare the counts to the lengths of the sets; a set S is a subset of D if and only if the count for S equals the number of elements in S.

This approach speeds up the search by eliminating checks of those sets which do not intersect D at all. You can extend the idea to eliminate even more checks by using an index from two-word sets to lists of sets containing both words. Now, sets which only have one word in common with D will not be checked (so sets with just one word need to be treated separately!). It becomes necessary to iterate through all two-elements subsets of D, comparing the counts to the number of two-element subsets of each set S, but otherwise is the same.

Even larger subsets can be used as keys in the index, but at some point you'll be generating more possible keys than the number of operations that will be saved. The optimal choice will depend on the specifics of C and the set of sentences.

like image 30
Michael J. Barber Avatar answered Oct 30 '22 10:10

Michael J. Barber