Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aho Corasick algorithm

I am not able to understand the below algorithm which is used for string pattern matching using Aho-Corasick alg.

Procedure AC(y,n,q0)
INPUT: y<-array of m bytes representing the text input
(SQL Query Statement)
n<-integer representing the text length
(SQL Query Length)
q0<-initial state (first character in pattern)
2: State <-q0
3: For i = 1 to n do
4: While g ( State, y[i] = = fail) do
5: State ← f (State)
6: End While
7: State ← g(State,.y[i])
8: If o(State) 􀂏 then
9: Output i
10: Else
11: Output
12: End If
13: End for
14: End Procedure
like image 614
user2967440 Avatar asked Dec 02 '13 13:12

user2967440


2 Answers

You probably won't gain a good understanding of the Aho-Corasick algorithm from reading a little bit of pseudocode. Unless you understand the state transition table, the algorithm will make no sense at all.

There's a decent explanation along with an animation at Aho-Corasick implementation and animation.

The original paper, Efficient String Matching: An Aid to Bibliographic Search(PDF), is well written and understandable, and the pseudocode examples are pretty easy to convert to working code. It'll take a little study, but you should have a good understanding after you read the paper, think about it a bit, and then read it again.

like image 153
Jim Mischel Avatar answered Dec 03 '22 12:12

Jim Mischel


Aho Corasick algorithm is used to solve set matching problem. It means we have a set of strings S, and here comes a long string L to check whether L contains any one in the previous set S.

An basic solution is using a trie tree, i.e. a prefix tree, please see Wikipedia. There are typically two steps to deal with the problem.

  1. Precompute the trie tree based on the string set S;
  2. Scan the string L to check.

The trie tree is easy to understand. It store prefix substrings with nodes starts from the root.

The Aho-Corasick algorithm is an extension of a trie tree, not far from the basic idea. Aho-Corasick algorithm adds a failed pointer to each node on a trie tree.

When fails, a trie tree will restart from the root (add the start index on L by 1), but Aho-Corasick algorithm will restart from the node D pointed by the failed pointer (add the start index on L by the depth of the node D).

Below is a C++ implementation of the Aho-Corasick algorithm. It contains some bugs. http://www.komodia.com/aho-corasick

I fixed the bugs I found. And you can access my version here: https://github.com/elfinhe/KomodiaAhoCorasick

like image 27
Tao HE Avatar answered Dec 03 '22 13:12

Tao HE