Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding all matches from a large set of Strings in a smaller String

I have a large set of Words and Phrases (a lexicon or dictionary) which includes wildcards. I need to find all instances of those Words and Phrases within a much smaller String (~150 characters at the moment).

Initially, I wanted to run the operation in reverse; that is to check to see if each word in my smaller string exists within the Lexicon, which could be implemented as a Hash Table. The problem is that some of these values in my Lexicon are not single words and many are wildcards (e.g. substri*).

I'm thinking of using the Rabin-Karp algorithm but I'm not sure this is the best choice.

What's an efficient algorithm or method to perform this operation?

Sample Data:

The dictionary contains hundreds of words and can potentially expand. These words may end with wildcard characters (asterisks). Here are some random examples:

  • good
  • bad
  • freed*
  • careless*
  • great loss

The text we are analyzing (at this point) are short, informal (grammar-wise) English statements. The prime example of text (again, at this point in time) would be a Twitter Tweet. These are limited to roughly 140 Characters. For example:

Just got the Google nexus without a contract. Hands down its the best phone 
I've ever had and the only thing that could've followed my N900.

While it may be helpful to note that we are performing very simple Sentiment Analysis on this text; our Sentiment Analysis technique is not my concern. I'm simply migrating an existing solution to a "real-time" processing system and need to perform some optimizations.

like image 825
Kurtis Avatar asked Dec 16 '22 12:12

Kurtis


1 Answers

I think this is an excellent use case for the Aho-Corasick string-matching algorithm, which is specifically designed to find all matches of a large set of strings in a single string. It runs in two phases - a first phase in which a matching automaton is created (which can be done in advance and requires only linear time), and a second phase in which the automaton is used to find all matches (which requires only linear time, plus time proportional to the total number of matches). The algorithm can also be adapted to support wildcard searching as well.

Hope this helps!

like image 182
templatetypedef Avatar answered Jan 19 '23 00:01

templatetypedef