Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm for linear pattern matching?

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101, 01010100100, OR 10100100010 within a list of length 8 million. I only need to find the first occurrence of either, and then return the index at which it occurs. However, doing the looping and accesses over the large list can be expensive, and I'd rather not do it too many times.

Is there a faster method than doing

foreach (patterns) {
    for (i=0; i < listLength; i++)
        for(t=0; t < patternlength; t++)
            if( list[i+t] != pattern[t] ) {
                 break;
            }
            if( t == patternlength - 1 ) {
                 return i;  // pattern found!
            }
        }
    }
}

Edit: BTW, I have implemented this program according to the above pseudocode, and performance is OK, but nothing spectacular. I'm estimating that I process about 6 million bits a second on a single core of my processor. I'm using this for image processing, and it's going to have to go through a few thousand 8 megapixel images, so every little bit helps.

Edit: If it's not clear, I'm working with a bit array, so there's only two possibilities: ONE and ZERO. And it's in C++.

Edit: Thanks for the pointers to BM and KMP algorithms. I noted that, on the Wikipedia page for BM, it says

The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly).

That looks interesting, but it didn't give any examples of such algorithms. Would something like that also help?

like image 295
erjiang Avatar asked Dec 02 '22 07:12

erjiang


2 Answers

The key for Googling is "multi-pattern" string matching.

Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.

An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.

I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide whether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.

like image 98
earl Avatar answered Dec 23 '22 07:12

earl


Yes.

The Boyer–Moore string search algorithm

See also: Knuth–Morris–Pratt algorithm

like image 23
Mitch Wheat Avatar answered Dec 23 '22 08:12

Mitch Wheat