Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boyer-Moore Practical in C#?

Boyer-Moore is probably the fastest non-indexed text-search algorithm known. So I'm implementing it in C# for my Black Belt Coder website.

I had it working and it showed roughly the expected performance improvements compared to String.IndexOf(). However, when I added the StringComparison.Ordinal argument to IndexOf, it started outperforming my Boyer-Moore implementation. Sometimes, by a considerable amount.

I wonder if anyone can help me figure out why. I understand why StringComparision.Ordinal might speed things up, but how could it be faster than Boyer-Moore? Is it because of the the overhead of the .NET platform itself, perhaps because array indexes must be validated to ensure they're in range, or something else altogether. Are some algorithms just not practical in C#.NET?

Below is the key code.

// Base for search classes abstract class SearchBase {     public const int InvalidIndex = -1;     protected string _pattern;     public SearchBase(string pattern) { _pattern = pattern; }     public abstract int Search(string text, int startIndex);     public int Search(string text) { return Search(text, 0); } }  /// <summary> /// A simplified Boyer-Moore implementation. ///  /// Note: Uses a single skip array, which uses more memory than needed and /// may not be large enough. Will be replaced with multi-stage table. /// </summary> class BoyerMoore2 : SearchBase {     private byte[] _skipArray;      public BoyerMoore2(string pattern)         : base(pattern)     {         // TODO: To be replaced with multi-stage table         _skipArray = new byte[0x10000];          for (int i = 0; i < _skipArray.Length; i++)             _skipArray[i] = (byte)_pattern.Length;         for (int i = 0; i < _pattern.Length - 1; i++)             _skipArray[_pattern[i]] = (byte)(_pattern.Length - i - 1);     }      public override int Search(string text, int startIndex)     {         int i = startIndex;          // Loop while there's still room for search term         while (i <= (text.Length - _pattern.Length))         {             // Look if we have a match at this position             int j = _pattern.Length - 1;             while (j >= 0 && _pattern[j] == text[i + j])                 j--;              if (j < 0)             {                 // Match found                 return i;             }              // Advance to next comparision             i += Math.Max(_skipArray[text[i + j]] - _pattern.Length + 1 + j, 1);         }         // No match found         return InvalidIndex;     } } 

EDIT: I've posted all my test code and conclusions on the matter at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

like image 306
Jonathan Wood Avatar asked Feb 05 '11 02:02

Jonathan Wood


People also ask

What is Boyer-Moore algorithm work?

Boyer-Moore-Horspool is an algorithm for finding substrings into strings. This algorithm compares each characters of substring to find a word or the same characters into the string. When characters do not match, the search jumps to the next matching position in the pattern by the value indicated in the Bad Match Table.

Where is Boyer-Moore algorithm used?

In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature.

Is Boyer Moore better than KMP?

But, Boyer Moore can be much faster than KMP, but only on certain inputs that allow many characters to be skipped (similar to the example in the 2nd point). So it can be faster or slower than KMP depending on the given input, while KMP is perfectly reliable at O(m+n).

Which is the best string matching algorithm?

Boyer Moore Algorithm: This algorithm uses best heurestics of Naive and KMP algorithm and starts matching from the last character of the pattern. Using the Trie data structure: It is used as an efficient information retrieval data structure.


1 Answers

Based on my own tests and the comments made here, I've concluded that the reason String.IndexOf() performs so well with StringComparision.Ordinal is because the method calls into unmanaged code that likely employs hand-optimized assembly language.

I have run a number of different tests and String.IndexOf() just seems to be faster than anything I can implement using managed C# code.

If anyone's interested, I've written everything I've discovered about this and posted several variations of the Boyer-Moore algorithm in C# at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

like image 135
Jonathan Wood Avatar answered Sep 30 '22 00:09

Jonathan Wood