Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient algorithm for searching one of several strings in a text?

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.

I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.

As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.

e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.

Any ideas or references are appreciated.

like image 992
davka Avatar asked Dec 16 '22 22:12

davka


2 Answers

Check out the Aho–Corasick string matching algorithm!

like image 121
maxschlepzig Avatar answered Mar 09 '23 01:03

maxschlepzig


Take a look at Suffix Tree.

like image 44
wilhelmtell Avatar answered Mar 09 '23 01:03

wilhelmtell