Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed of looking for byte[] in byte[] and string in string - why latter is faster?

I have a task where I need to look up for sequences in a file. When doing a test application, I have read file as string (File.ReadAllText) and used string.IndexOf to look up for a sequence. When I tried to implement the same algorithm with bytes (reading file as byte array and looking up byte array in byte array), I noticed that looking for byte[] in byte[] is about 3 times as slow as looking for string in string. I made sure to check it throughly, and exactly the same code, one using byte[] and other using string, take 3 times as much to execute - like, 16s for byte vs ~5s for string.

For looking up byte arrays, I used ways described here byte[] array pattern search. For looking up strings, I used built-in IndexOf function of string class. Here is one of the implementations of IndexOf for byte[] I tried:

    public int IndexOf(byte[] source, byte[] pattern, int startpos = 0)
    {
        int search_limit = source.Length - pattern.Length;
        for (int i = startpos; i < search_limit; i++)
        {
            if (source[i] == pattern[0])
            {
                bool found = true;
                for (int j = 1; j < pattern.Length; j++)
                {
                    if (source[i + j] != pattern[j])
                    {
                        found = false;
                        break;
                    }
                }
                if (found)
                    return i;
            }
        }
        return -1;
    }

Basically, looking up next match for sequence of bytes in bytes array takes three time as long as looking up next match for sequence of chars in string. My question is - WHY?

Does anyone know how .Net handles looking up chars in string, what kind of optimisation it does, what algorithm it uses? Is there a faster algorithm than the one I'm using here? Maybe someone has an idea what I'm doing wrong here so that it takes more time than it should? I really do not understand how looking up string in string can be 3 times as fast as looking up byte[] in byte[]...

UPDATE: I have tried the unsafe algorithm as suggested. It was as follows:

public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0)
    {
        long i = startpos;
        fixed (byte* H = Haystack) fixed (byte* N = Needle)
        {
            for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
            {

                    bool Found = true;
                    for (byte* hInc = hNext, nInc = N, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
                    if (Found) return i;

            }
            return -1;
        }
    }
}

strange thing is, it actually proved to be twice as slow! I changed it to add my personal tweak (checking first letter before attempting to iterate through needle) and it looks like this now:

public static unsafe long IndexOfFast(byte[] Haystack, byte[] Needle, long startpos = 0)
    {
        long i = startpos;
        fixed (byte* H = Haystack) fixed (byte* N = Needle)
        {
            for (byte* hNext = H + startpos, hEnd = H + Haystack.LongLength; hNext < hEnd; i++, hNext++)
            {
                if (*hNext == *N)
                {
                    bool Found = true;
                    for (byte* hInc = hNext+1, nInc = N+1, nEnd = N + Needle.LongLength; Found && nInc < nEnd; Found = *nInc == *hInc, nInc++, hInc++) ;
                    if (Found) return i;
                }
            }
            return -1;
        }
    }

Now, it takes exactly the same time to execute as the safe one. My question is again - any ideas why? Shouldn't it be faster because it's unsafe and operates with pointers, when compared to safe?

like image 238
Istrebitel Avatar asked May 31 '13 13:05

Istrebitel


1 Answers

Basically, looking up next match for sequence of bytes in bytes array takes three time as long as looking up next match for sequence of chars in string. My question is - WHY?

Because the string search algorithm has been heavily optimized; it is written in tight unmanaged code which does not spend time checking array bounds. If you were to optimize your byte search algorithm in the same way, it would be just as fast; the string search algorithm uses the same naive algorithm that you are using.

Your algorithm is fine -- this is the standard "naive" search, and Kevin's claims notwithstanding, the naive algorithm is in practice almost always the fastest on typical data. Blazing through an array looking for a byte is incredibly fast on modern hardware. It depends on the size of your problem; if you're looking for a long DNA string in the middle of the human genome then Boyer-Moore is totally worth the expense of preprocessing. If you're looking for 0xDEADBEEF in a twenty KB file then you're not going to beat the naive algorithm if it is efficiently implemented.

For maximum speed what you should do here is turn off the safety system and write the code using unsafe pointer arithmetic.

like image 130
Eric Lippert Avatar answered Sep 24 '22 07:09

Eric Lippert