Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scalability of aho corasick

I want to search a text document for occurrences of keyphrases from a database of keyphrases (extracted from wikipedia article titles). (ie. given a document i want to find whether any of the phrases have a corresponding wikipedia article) I found out about the Aho-Corasick algorithm. I want to know if building an Aho-Corasick automaton for dictionary of millions of entries is efficient and scalable.

like image 600
z33m Avatar asked Feb 27 '11 15:02

z33m


3 Answers

Let's just make a simple calculations :

Assume that you have 1 million patterns (strings, phrases) with average length 10 characters and a value (label, token, pointer etc) of 1 word (4 bytes) length , assigned to each pattern

Then you will need an array of 10+4=14 million bytes (14Mb) just to hold the list of patterns.

From 1 million patterns 10 bytes (letters, chars) each you could build an AC trie with no more than 10 million nodes. How big is this trie in practice depends on the size of each node. It should at least keep 1 byte for a label (letter) and word (4bytes) for a pointer to a next node in trie (or a pattern for a terminal node) plus 1 bit (boolean) to mark terminal node, Total about 5 bytes

So, the minimum size of a trie for 1 million patterns 10 chars you will need min 50 million bytes or about 50 Mb of memory.

In practice it might be 3-10 times more , but yet is very-very manageable, as even 500Mb memory is very moderate today. (Compare it with Windows applications like Word or Outlook)

Given that in terms of speed Aho-Corasick (AC) algorithm is almost unbeatable, it still remains the best algorithm for multiple pattern match ever. That's my strong personal educated opinion apart from academic garbage .

All reports of "new" latest and greatest algorithms that might outperform AC are highly exaggerated (except maybe for some special cases with short patterns like DNA)

The only improvement of AC could in practice go along the line of more and faster hardware (multiple cores, faster CPUs, clusters etc)

Don't take my word for it, test it for yourself. But remember that real speed of AC strongly depends on implementation (language and quality of coding)

like image 114
nirvichara Avatar answered Oct 01 '22 23:10

nirvichara


In theory, it should maintain linear speed subject only to the effects of the memory hierarchy - it will slow down as it gets too big to fit in cache, and when it gets really big, you'll have problems if it starts getting paged out.

OTOH the big win with Aho-Corasick is when searching for decent sized substrings that may occur at any possible location within the string being fed in. If your text document is already cut up into words, and your search phrases are no more than e.g. 6 words long, then you could build a hash table of K-word phrases, and then look up every K-word contiguous section of words from the input text in it, for K = 1..6.

(Answer to comment)

Aho-Corasick needs to live in memory, because you will be following pointers all over the place. If you have to work outside memory, it's probably easiest to go back to old-fashioned sort/merge. Create a file of K-word records from the input data, where K is the maximum number of words in any phrase you are interested in. Sort it, and then merge it against a file of sorted Wikipedia phrases. You can probably do this almost by hand on Unix/Linux, using utilities such as sort and join, and a bit of shell/awk/perl/whatever. See also http://en.wikipedia.org/wiki/Key_Word_in_Context (I'm old enough to have actually used one of these indexes, provided as bound pages of computer printout).

like image 40
mcdowella Avatar answered Oct 02 '22 00:10

mcdowella


Well there is a workaround. By writing the built AC trie of dictionary into a text file in a xml-like format, making an index file for the first 6 levels of that trie, etc... In my tests I search for all partial matches of a sentence in the dictionary (500'000 entries), and I get ~150ms for ~100 results for a sentence of 150-200 symbols.

For more details, check out this paper : http://212.34.233.26/aram/IJITA17v2A.Avetisyan.doc

like image 39
Aram Avetisyan Avatar answered Oct 01 '22 22:10

Aram Avetisyan