Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching bit strings

I have needed to implement a string searching algorithm that finds a pattern of bits in a text of bits (the match may not be byte/word aligned). For starters, I implemented the Boyer-Moore algorithm, but comparing individual bits was too slow for my purposes. So instead I tried implementing a blocked based version that would compare entire bytes/words as described in this paper, but it has become complex and unmanageable (in part due to me not entirely understanding what I'm doing.)

Does anyone have good implementation of such an algorithm?

My specific use case is with pattern length N >= 32, text window 2N, and bits packed into ints. Also N in this case is a multiple of char size N % 8 == 0. I preprocess once and use many times on changing text, like the Boyer-Moore. First match is all I need. Performance is key.

Edit: After successfully implementing the Blocked Boyer-Moore algorithm, I notice no improvement(my bit by bit version is faster!) It's probably a mistake of my own, because I've been racking my brain on it and optimized it to the point where it makes no sense without many lines of comments, yet it's still slower. Here it is.

like image 782
scientiaesthete Avatar asked Aug 24 '12 05:08

scientiaesthete


2 Answers

Overview

I'm new to the SO community, but I look forward to giving something back.

Interesting problem. I put together an implementation which only does byte-based comparisons (with the aid of pre-computed bit patterns and bit masks) as opposed to performing expensive bit-manipulations on compare. As a result, it should be reasonably fast. It doesn't implement any of the Shift Rules (performance optimizations) discussed for the Boyer-Moore algorithm and thus could be further improved.

Although this implementation does depend on the number of pattern bits % CHAR_BIT == 0--on an 8-bit machine, meets your criteria of N % 8 == 0, the implementation will find non-byte-aligned bit patterns. (It also currently requires 8-bit characters ( CHAR_BIT == 8 ), but in the unlikely event your system is not using 8-bit chars, is easily adaptable by changing the all of the arrays/vectors from uint8_t to char, and adjusting the values they contain to reflect the correct number of bits.)

Given that the search doesn't do any bit-twiddling (besides anding pre-computed byte masks) it should be pretty performant.

Algorithm summary

In a nutshell, the pattern to be searched for is specified, and the implementation shifts it by one bit and records the shifted pattern. It also computes masks for the shifted pattern, as for non-byte-aligned bit patterns, some bits at the beginning and end of the comparison will need to be ignored for proper behavior.

The search is conducted for all the pattern bits in each shift position until a match is found or the end of the data buffer is reached.

//
//  BitStringMatch.cpp
//

#include "stdafx.h"
#include <iostream>
#include <cstdint>
#include <vector>
#include <memory>
#include <cassert>

int _tmain(int argc, _TCHAR* argv[])
{
    //Enter text and pattern data as appropriate for your application.  This implementation assumes pattern bits % CHAR_BIT == 0
    uint8_t text[] = { 0xcc, 0xcc, 0xcc, 0x5f, 0xe0, 0x1f, 0xe0, 0x0c }; //1010 1010, 1010 1010, 1010 1010, 010*1 1111, 1110 0000, 0001 1111, 1110 0000, 000*0 1010 
    uint8_t pattern[] = { 0xff, 0x00, 0xff, 0x00 }; //Set pattern to 1111 1111, 0000 0000, 1111 1111, 0000 0000

    assert( CHAR_BIT == 8 ); //Sanity check
    assert ( sizeof( text ) >= sizeof( pattern ) ); //Sanity check

    std::vector< std::vector< uint8_t > > shiftedPatterns( CHAR_BIT, std::vector< uint8_t >( sizeof( pattern ) + 1, 0 ) );  //+1 to accomodate bit shifting of CHAR_BIT bits.
    std::vector< std::pair< uint8_t, uint8_t > > compareMasks( CHAR_BIT, std::pair< uint8_t, uint8_t >( 0xff, 0x00 ) );

    //Initialize pattern shifting through all bit positions
    for( size_t i = 0; i < sizeof( pattern ); ++i ) //Start by initializing the unshifted pattern
    {
        shiftedPatterns[ 0 ][ i ] = pattern[ i ];
    }

    for( size_t i = 1; i < CHAR_BIT; ++i )  //Initialize the other patterns, shifting the previous vector pattern to the right by 1 bit position
    {
        compareMasks[ i ].first >>= i;  //Set the bits to consider in the first...
        compareMasks[ i ].second = 0xff << ( CHAR_BIT - i ); //and last bytes of the pattern

        bool underflow = false;
        for( size_t j = 0; j < sizeof( pattern ) + 1; ++j )
        {
            bool thisUnderflow = shiftedPatterns[ i - 1 ][ j ] & 0x01 ? true : false; 
            shiftedPatterns[ i ][ j ] = shiftedPatterns[ i - 1][ j ] >> 1;

            if( underflow ) //Previous byte shifted out a 1; shift in a 1
            {
                shiftedPatterns[ i ][ j ] |= 0x80;  //Set MSb to 1
            }

            underflow = thisUnderflow;
        }
    }

    //Search text for pattern
    size_t maxTextPos = sizeof( text ) - sizeof( pattern );
    size_t byte = 0;
    bool match = false;
    for( size_t byte = 0; byte <= maxTextPos && !match; ++byte )
    {
        for( size_t bit = 0; bit < CHAR_BIT && ( byte < maxTextPos || ( byte == maxTextPos && bit < 1 ) ); ++bit )
        {
            //Compare first byte of pattern
            if( ( shiftedPatterns[ bit ][ 0 ] & compareMasks[ bit ].first ) != ( text[ byte ] & compareMasks[ bit ].first ) )
            {
                continue;
            }

            size_t foo = sizeof( pattern );
            //Compare all middle bytes of pattern
            bool matchInProgress = true;
            for( size_t pos = 1; pos < sizeof( pattern ) && matchInProgress; ++pos )
            {
                matchInProgress = shiftedPatterns[ bit ][ pos ] == text[ byte + pos ];
            }
            if( !matchInProgress )
            {
                continue;
            }

            if( bit != 0 )  //If compare failed or we're comparing the unshifted pattern, there's no need to compare final pattern buffer byte
            {
                if( ( shiftedPatterns[ bit ][ sizeof( pattern ) ] & compareMasks[ bit ].second ) != ( text[ byte + sizeof( pattern ) ] & compareMasks[ bit ].second ) )
                {
                    continue;
                };
            }

            //We found a match!
            match = true;   //Abandon search
            std::cout << "Match found!  Pattern begins at byte index " << byte << ", bit position " << CHAR_BIT - bit - 1 << ".\n";
            break;
        }
    }
    //If no match
    if( !match )
    {
        std::cout << "No match found.\n";
    }

    std::cout << "\nPress a key to exit...";
    std::getchar();

    return 0;
}

I hope this is helpful.

like image 199
U007D Avatar answered Oct 09 '22 12:10

U007D


If N is big (bigger than, say, 16 bits) then it would be pretty easy to do a preliminary search against 8 shifted copies of the bit pattern (truncating the pattern to eliminate 'partial bytes'). Then you can refine the results by looking at adjacent bits. The byte search (against the 8 shifted copies) could be done using Boyer-Moore or a similarly efficient algorithm.

In case you are wondering: 8 byte searches is probably faster than one bit search since each byte comparison takes just one instruction, whereas the bit manipulation needed to do the bit search takes far more instructions per bit. However, you should always profile to be sure.

like image 22
nneonneo Avatar answered Oct 09 '22 13:10

nneonneo