Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to find multiple string matches

Tags:

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.

Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.

I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.

An example of search terms could be a list of Fortune 500 company names.

Ideas?

like image 701
Dwight Kelly Avatar asked Jul 15 '10 23:07

Dwight Kelly


People also ask

Which algorithm is used for string matching?

Hashing-string matching algorithms: Rabin Karp Algorithm: It matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters.

What is the fastest string matching algorithm?

The Aho-Corasick string searching algorithm simultaneously finds all occurrences of multiple patterns in one pass through the text. On the other hand, the Boyer-Moore algorithm is understood to be the fastest algorithm for a single pattern.

Which algorithm is used to find the string pattern in text O n time?

The Boyer–Moore string-search algorithm has been the standard benchmark for the practical string-search literature.

What is the algorithm used for pattern searching?

We use certain algorithms called pattern recognition to do the searching process. The complexity of pattern searching is O(m(n-m+1)). The algorithms are also known as String Searching Algorithms. These are very useful while we perform the search operation in the database.


2 Answers

Don´t reinvent the wheel

This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.

The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.

In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.

And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.

A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.

like image 186
Dr. belisarius Avatar answered Sep 20 '22 22:09

Dr. belisarius


As in the case of single patterns, there are several algorithms for multiple-pattern matching, and you will have to find the one that fits your purpose best. The paper A fast algorithm for multi-pattern searching (archived copy) does a review of most of them, including Aho-Corasick (which is kind of the multi-pattern version of the Knuth-Morris-Pratt algorithm, with linear complexity) and Commentz-Walter (a combination of Boyer-Moore and Aho-Corasick), and introduces a new one, which uses ideas from Boyer-Moore for the task of matching multiple patterns.

An alternative, hash-based algorithm not mentioned in that paper, is the Rabin-Karp algorithm, which has a worst-case complexity bigger than other algorithms, but compensates it by reducing the linear factor via hashing. Which one is better depends ultimately on your use case. You may need to implement several of them and compare them in your application if you want to choose the fastest one.

like image 27
Pedro Gimeno Avatar answered Sep 16 '22 22:09

Pedro Gimeno