Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aho-Corasick and Proper Substrings

I'm trying to understand the aho-corasick string match algorithm. Suppose that our patterns are abcd and bc. We end up a tree like this

       []
       /\
    [a]..[b]
    /  :  |
   [b].: [c]
    |     :
   [c].....
    |
   [d]

The dotted line show the failure function.

Now supposed that we feed in the string abcd. This will follow the tree and detect the match "abcd," however, as far as I can tell the match bc will not be reported. Am I misunderstanding the algorithm?

like image 486
Winston Ewert Avatar asked Mar 22 '11 16:03

Winston Ewert


People also ask

What is Aho Corasick used for?

In computer science, the Aho–Corasick algorithm is a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick in 1975. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text.

What is the complexity of the Aho Corasick algorithm?

The complexity of the Aho-Corasick algorithm is O(N + L + Z), where Z is the count of matches. This algorithm was invented by Alfred V. Aho and Margaret J. Corasick in 1975.


2 Answers

Artem's answer is correct, but perhaps not very clear. Basically what you need to do is: every time you arrive at a new node, examine the whole path starting from this node and consisting of failure links and find matches on this path. (This doesn't change your current position). In your example you would examine the paths b->b (no matches found) and c->c (the match bc is found).

Note that for efficiency reasons you need to cache the value of 'next match' at each node. That is, if you examine the path starting at node u and find a matching node v after several steps, remember the value next_match[u] = v so that next time when you have to examine this path, you could make a single jump straight to v.

like image 83
adamax Avatar answered Sep 28 '22 11:09

adamax


You have to mark nodes as really_final if there is a string ended in this node. In your example such nodes are "abcd" and "bc". After it you have to calc final states for nodes: node is final if it's really_final or node by failure function is final. So, "abcd", "bc" and "abc" will be final.

In other words - node is final if some pattern is ended here or some final node is reacheable from current node walking by failure links.

like image 44
Artem Volkhin Avatar answered Sep 28 '22 10:09

Artem Volkhin