Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High speed string matching algorithms

I'm basically benchmarking some high speed string matching algorithms, I came across a few.

  1. Backwards Non-deterministic DAWG (Directed acyclic word graph) Matching algorithm by Gonzalo Navarro and Mathieu Raffinot. See "A Bit-Parallel Approach to Suffix Automata: Fast Extended String Matching"

  2. Horspool's improved version of the Boyer-Moore String searching algorithm. See "Practical fast searching in strings"

  3. Shift-Or algorithm with mismatches

  4. KMP

Are there any other better high speed string matching algorithms i can try ?

Edit : There is another thread in similar lines , which has good references too

like image 596
sashank Avatar asked Jul 26 '12 06:07

sashank


People also ask

What is the fastest string matching algorithm?

The Aho-Corasick string searching algorithm simultaneously finds all occurrences of multiple patterns in one pass through the text. On the other hand, the Boyer-Moore algorithm is understood to be the fastest algorithm for a single pattern.

Which algorithm is used for string matching?

Hashing-string matching algorithms: Rabin Karp Algorithm: It matches the hash value of the pattern with the hash value of current substring of text, and if the hash values match then only it starts matching individual characters.

Which is better KMP or Rabin Karp algorithm?

The most important difference between them is how reliable they are in finding a match. KMP guarantees 100% reliability. You cannot guarantee 100% with Rabin Karp because of a chance of collision during hash table lookup.

What is best matching algorithm?

BM25 [5] is the Best Matching ranking algorithm that ranks the documents according to its relevance to the given search query.


1 Answers

Two-way string matching is, to my knowledge, the best general-purpose algorithm for string matching. It has linear worst-case complexity, uses constant space, and doesn't backtrack too much more than necessary. And the theory behind it is very nice.

If you know that your users aren't jerks, naive string matching optimised for your architecture will win for short "needles" while a Boyer-Moore variant will start really doing the sublinear thing for long "needles." However, naive string matching has a quadratic worst case and Boyer-Moore can be made to examine all characters in the input. The extra tables needed to handle mismatches actually carry a surprisingly severe penalty over two-way string matching.

like image 82
tmyklebu Avatar answered Oct 05 '22 02:10

tmyklebu