Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient String/Pattern Matching in C++ (suffixarray, trie, suffixtree?)

I'm looking for an efficient data structure to do String/Pattern Matching on an really huge set of strings. I've found out about tries, suffix-trees and suffix-arrays. But I couldn't find an ready-to-use implementation in C/C++ so far (and implementing it by myself seems difficult and error-prone to me). But I'm still not sure if Suffix-Arrays are really the thing I'm looking for... I've tried libdivsufsort and esaxx, but couldn't find out how to use them for my needs:

I want to use an predefined set of strings, with wildcards (or even regular expressions) to match an user input. I got a huge list of predefined strings i.e.

"WHAT IS *?" "WHAT IS XYZ?" "HOW MUCH *?" ...

Now I want to find the best matching string (if there's one, that matches at all). I.e. User input: >WHAT IS XYZ? Should find "WHAT IS XYZ?" instead of "WHAT IS *?", but "WHAT IS SOMETHING?" should find "WHAT IS *?" (assuming * is a wildcard for any count of characters).

Building the structure isn't time critical (and the structure don't have to be super space efficient), but the search shouldn't take too long. How can that be done easily? Any Framework/Library or code example is welcome

Thanks

like image 713
Constantin Avatar asked Nov 13 '12 16:11

Constantin


1 Answers

Given your comment that the patterns do not need to be updated at runtime I'm not sure you need a runtime structure at all.

I'd recommend using re2c or ragel to compile the patterns to code that will do the pattern matching.

like image 95
mythagel Avatar answered Sep 23 '22 19:09

mythagel