I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.
Is there a faster algorithm than doing a simple text search for each of the words in the block of text?
The Karp-Rabin Algorithm.
Exact String Matching Algorithms: These are further classified into four categories: Algorithms based on character comparison: Naive Algorithm: It slides the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches.
We use certain algorithms called pattern recognition to do the searching process. The complexity of pattern searching is O(m(n-m+1)). The algorithms are also known as String Searching Algorithms.
input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.
Faster though I don't know, just another method (would depend on how many words you are searching for).
simple perl examp:
my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my @words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (@words)
{
print "found word: $word\n" if exists $hash{$word};
}
__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown
Try out the Aho-Corasick algorithm: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
Build up a trie of your words, and then use that to find which words are in the text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With