Problem Background
I have a finite vocabulary containing of say 10 symbols [A-J]. What these symbols mean is not relevant to the question. They could be DNA bases, phonemes, words etc.
An item is a sequence of symbols. In this problem, all items are of the same length (say 6). E.g.
A C B A D J
I have a large (5M) table that contains counts of all items of length 6 sampled from some known data. E.g.
A C B A D J 55
B C B I C F 923
A B C D E G 478
Given a new sequence with one unknown symbol, my task is to guess the symbol. In the following example, the missing symbol is ?.
B C B ? C F
A simple solution to guess ? is to look into my table and find the item with the largest count that fits the pattern B C B ? C F
Questions
What is a good data structure to store my item-frequency table so that I handle space-time reasonably efficiently? I prefer to use less memory if the computation at query time is reasonable. (I will be having many such tables and so the 5M number is just an approximation.)
What are some implementation details that can make a big difference in processing speed?
Things I have thought of:
Make a string out of every sequence and use regexes to match. Caveat: 1. O(n) is unacceptable. (2) Regexes are slow. (3) Strings (in java at least) are bloated.
Let Lucene handle the indexing. Turn off tfidf scoring. Use phrase-search. Potentially use the count values for scoring so that Lucene takes care of the sorting too.
Use prefix and suffix tries to index each item.
Use db (likely in-memory) with the entire data in one/separate column to handle the search.
Updates
Decision with tries seems to be the best one: with number of string occurrence on leaves you can easily design function that will return all possible strings with one missing character in O(log n) time, and then you just iterate over this small number of strings, searching for the max number of occurrences. If you use chars from A to Z, there will be at most 26 such strings, so iterating will not take a lot.
AFAIK, Lucene uses such mechanism internally for its wildcards search, so you can concatenate your chars, index them with KeywordAnalyzer
(to omit stemming) and then search as "ACB?DJ". The only restriction here is that Lucene cannot handle searches with first "?", but you can bypass it by adding one extra char at the beginning (just trick to bypass Lucene checks) or by having one more index for reversed words (will increase performance for words with wildcard as a first char a lot).
And, finally, if you first have to calculate number of occurrences anyway, you can use some machine learning schemes such as decision trees to handle all the work. There were cases, when decision trees were used to compress database and speed up search, so you can do the same. Use lines as instances, position of chars as attributes and chars themselves as attribute values. Then run some algorithm like C4.5 (you can use Weka's implementation called J48) with minimal pruning and run classification - the algorithm will do the rest!
Based on the comment that there will only be 1 unknown you can do the following:
But your data in a hashtable. When you need to look up a pattern, generate all the wildcard combinations, since your vocabulary is limited this would mean at most looking up 20 patterns. This sounds like a hack, but if you consider the performance implications of other methods it is hard to beat. The hash table lookup is O(1), 20 lookups is O(1) too.
This method is not advisable if the number of wildcards could increase, although it may still perform well for 2 or 3.
A double array trie would also work and may reduce the amount of space to store your strings, but the performance would suffer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With