I have a large set of Words and Phrases (a lexicon or dictionary) which includes wildcards. I need to find all instances of those Words and Phrases within a much smaller String (~150 characters at the moment).
Initially, I wanted to run the operation in reverse; that is to check to see if each word in my smaller string exists within the Lexicon, which could be implemented as a Hash Table. The problem is that some of these values in my Lexicon are not single words and many are wildcards (e.g. substri*).
I'm thinking of using the Rabin-Karp algorithm but I'm not sure this is the best choice.
What's an efficient algorithm or method to perform this operation?
Sample Data:
The dictionary contains hundreds of words and can potentially expand. These words may end with wildcard characters (asterisks). Here are some random examples:
The text we are analyzing (at this point) are short, informal (grammar-wise) English statements. The prime example of text (again, at this point in time) would be a Twitter Tweet. These are limited to roughly 140 Characters. For example:
Just got the Google nexus without a contract. Hands down its the best phone
I've ever had and the only thing that could've followed my N900.
While it may be helpful to note that we are performing very simple Sentiment Analysis on this text; our Sentiment Analysis technique is not my concern. I'm simply migrating an existing solution to a "real-time" processing system and need to perform some optimizations.
I think this is an excellent use case for the Aho-Corasick string-matching algorithm, which is specifically designed to find all matches of a large set of strings in a single string. It runs in two phases - a first phase in which a matching automaton is created (which can be done in advance and requires only linear time), and a second phase in which the automaton is used to find all matches (which requires only linear time, plus time proportional to the total number of matches). The algorithm can also be adapted to support wildcard searching as well.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With