Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for Pattern Matching on large data

Problem Background

I have a finite vocabulary containing of say 10 symbols [A-J]. What these symbols mean is not relevant to the question. They could be DNA bases, phonemes, words etc.

An item is a sequence of symbols. In this problem, all items are of the same length (say 6). E.g.

A C B A D J

I have a large (5M) table that contains counts of all items of length 6 sampled from some known data. E.g.

A C B A D J     55
B C B I C F     923
A B C D E G     478

Given a new sequence with one unknown symbol, my task is to guess the symbol. In the following example, the missing symbol is ?.

B C B ? C F

A simple solution to guess ? is to look into my table and find the item with the largest count that fits the pattern B C B ? C F

Questions

  1. What is a good data structure to store my item-frequency table so that I handle space-time reasonably efficiently? I prefer to use less memory if the computation at query time is reasonable. (I will be having many such tables and so the 5M number is just an approximation.)

  2. What are some implementation details that can make a big difference in processing speed?

Things I have thought of:

  1. Make a string out of every sequence and use regexes to match. Caveat: 1. O(n) is unacceptable. (2) Regexes are slow. (3) Strings (in java at least) are bloated.

  2. Let Lucene handle the indexing. Turn off tfidf scoring. Use phrase-search. Potentially use the count values for scoring so that Lucene takes care of the sorting too.

  3. Use prefix and suffix tries to index each item.

  4. Use db (likely in-memory) with the entire data in one/separate column to handle the search.


Updates

  1. In my actual application, I will be working with sequences of length 5,6,7,8,9,10 stored separately. I simplified the problem by restricting it to a fixed length. Hence the constraint/preference to a solution that uses less memory.
  2. My vocabulary size can be assumed to be under 20.
like image 946
hashable Avatar asked May 09 '11 18:05

hashable


2 Answers

Decision with tries seems to be the best one: with number of string occurrence on leaves you can easily design function that will return all possible strings with one missing character in O(log n) time, and then you just iterate over this small number of strings, searching for the max number of occurrences. If you use chars from A to Z, there will be at most 26 such strings, so iterating will not take a lot.

AFAIK, Lucene uses such mechanism internally for its wildcards search, so you can concatenate your chars, index them with KeywordAnalyzer (to omit stemming) and then search as "ACB?DJ". The only restriction here is that Lucene cannot handle searches with first "?", but you can bypass it by adding one extra char at the beginning (just trick to bypass Lucene checks) or by having one more index for reversed words (will increase performance for words with wildcard as a first char a lot).

And, finally, if you first have to calculate number of occurrences anyway, you can use some machine learning schemes such as decision trees to handle all the work. There were cases, when decision trees were used to compress database and speed up search, so you can do the same. Use lines as instances, position of chars as attributes and chars themselves as attribute values. Then run some algorithm like C4.5 (you can use Weka's implementation called J48) with minimal pruning and run classification - the algorithm will do the rest!

like image 145
ffriend Avatar answered Sep 19 '22 18:09

ffriend


Based on the comment that there will only be 1 unknown you can do the following:

But your data in a hashtable. When you need to look up a pattern, generate all the wildcard combinations, since your vocabulary is limited this would mean at most looking up 20 patterns. This sounds like a hack, but if you consider the performance implications of other methods it is hard to beat. The hash table lookup is O(1), 20 lookups is O(1) too.

This method is not advisable if the number of wildcards could increase, although it may still perform well for 2 or 3.

A double array trie would also work and may reduce the amount of space to store your strings, but the performance would suffer.

like image 38
idz Avatar answered Sep 18 '22 18:09

idz