I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.
I am currently using boost regex matching with str1 | str2 | ...
. The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.
As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.
e.g., if the strings are abcx
, bcy
and cz
, and I've read so far abc
, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1
. Then reading x
next will move me to the string 1 matched
state etc., and any char other than xyz
will move to the initial state, and I will not need to retract back to b
.
Any ideas or references are appreciated.
Check out the Aho–Corasick string matching algorithm!
Take a look at Suffix Tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With