Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for multiple word matching in text

I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.

Is there a faster algorithm than doing a simple text search for each of the words in the block of text?

like image 838
Enrico Detoma Avatar asked Jul 08 '09 19:07

Enrico Detoma


People also ask

Which algorithm is best for string matching?

The Karp-Rabin Algorithm.

How many string matching algorithms are there?

Exact String Matching Algorithms: These are further classified into four categories: Algorithms based on character comparison: Naive Algorithm: It slides the pattern over text one by one and check for a match. If a match is found, then slides by 1 again to check for subsequent matches.

What is the algorithm used for pattern searching?

We use certain algorithms called pattern recognition to do the searching process. The complexity of pattern searching is O(m(n-m+1)). The algorithms are also known as String Searching Algorithms.


3 Answers

input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.

Faster though I don't know, just another method (would depend on how many words you are searching for).

simple perl examp:

my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my @words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (@words)
{
    print "found word: $word\n" if exists $hash{$word};
}

__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown
like image 139
user105033 Avatar answered Oct 02 '22 14:10

user105033


Try out the Aho-Corasick algorithm: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

like image 13
Cuga Avatar answered Oct 02 '22 15:10

Cuga


Build up a trie of your words, and then use that to find which words are in the text.

like image 5
FryGuy Avatar answered Oct 02 '22 16:10

FryGuy