Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for large number of patterns

In an interview, I was asked to come up with data structure that can hold millions of patterns and enables fast searching through them to find the longest matching one.

For instance, patterns are like:

1- 8876 8893 87          | true
2- 8876 889              | false
3- 8876 8                | false
4- 887                   | true

The input is a number with at least 2 and at most 18 digits and we need to find the longest matching pattern from the data structure and extract the boolean at the end.

For example, 8876 8893 9943 53 would match the 1 and true is returned. 8876 8397 5430 74 would match the 3 and false is returned.

My answer was to use a tree and having a list of key value pair at each level. Key being the digits and value is either null or equal to boolean depending on if it's the end of a pattern or not. Like:

# matching 8875
# start the search by first digit
[..., (7, null), (8, null), (9, null)] 
                  ^
                 [..., (7, null), (8, null), (9, null)] 
                                   ^
                                   [..., (7, true), (8, null), ...]
# at the last step because we don't have a pattern 
# to match the digit 5, we return the `true` from (7, true)

The challenging part is that, the patterns are quite a lot. Millions of them. Is this any good? If not, what is your suggestion.

like image 650
paytonpy Avatar asked Apr 21 '15 17:04

paytonpy


1 Answers

A very nice data structure that fits very well with the problem you described, i.e. a collection structure where many of the entries share a common prefix (and/or suffix), and where you perform searches based upon a shared prefix is a Trie.

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest. For the space-optimized presentation of prefix tree, see compact prefix tree.

Specifically the compact prefix tree or patricia trie seems to be well suited for your problem.

Given that the mentioned types of tries are often used to store values associated with keys, if that is not required for your problem (i.e. you are not required to store the original index of the input patterns string and return that on a search), there is a closely related solution that may fit even better. As noted by @JimMischel in the comments the Aho–Corasick string matching algorithm builds a trie like structure with additional links between the internal nodes. If the set of patterns to be matched is fixed, and the data structure built, then for a search its run time is linear in the length of the input plus the number of matched entries.

It is discussed in this SO question Aho Corasick algorithm

You can find some implementations of it online in for example C# or Java or Haskell.

like image 174
Alex Avatar answered Oct 10 '22 07:10

Alex