Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to find pattern in byte array

Tags:

c#

.net

.net-4.5

I have this following code:

var file = //Memory stream with a file in it
var bytes = file.ToArray();

I need to search the bytes for the first occurrence (if any) of the specified byte sequence: 0xff, 0xd8. (The purpose of this is to find images embedded in files)

So if for example bytes[6501] contains 0xff and bytes[6502] contains 0xd8, thats a match and I need either the index of the position returned (6501), or a new array, which is a copy of the bytes array, except it doesn't have the keys below 6501 from the old array.

My current solution is to loop:

 for (var index = 0; index < bytes.Length; index++)
 {
     if((new byte[] {0xff, 0xd8}).SequenceEqual(bytes.Skip(index).Take(2))
    ...

But it's pretty slow when it's handling bigger files.

Is there some more efficient way to handle this?

like image 635
Fred Avatar asked Aug 20 '14 08:08

Fred


1 Answers

If this is time-critical code, I found the C# compiler (both Mono's implementation and Microsoft's) to have special logic to optimize simple scan loops.

So from profiling experience, I'd implement a sequence search with a hardcoded first-element search like this:

/// <summary>Looks for the next occurrence of a sequence in a byte array</summary>
/// <param name="array">Array that will be scanned</param>
/// <param name="start">Index in the array at which scanning will begin</param>
/// <param name="sequence">Sequence the array will be scanned for</param>
/// <returns>
///   The index of the next occurrence of the sequence of -1 if not found
/// </returns>
private static int findSequence(byte[] array, int start, byte[] sequence) {
  int end = array.Length - sequence.Length; // past here no match is possible
  byte firstByte = sequence[0]; // cached to tell compiler there's no aliasing

  while(start <= end) {
    // scan for first byte only. compiler-friendly.
    if(array[start] == firstByte) {
      // scan for rest of sequence
      for (int offset = 1;; ++offset) {
        if(offset == sequence.Length) { // full sequence matched?
          return start;
        } else if(array[start + offset] != sequence[offset]) {
          break;
        }
      }
    }
    ++start;
  }

  // end of array reached without match
  return -1;
}

Quite a bit longer than other suggestions and prone to off-by-1 errors, but if you're scanning a huge chunk of data or doing this for frequent device IO, this setup will avoid feeding the garbage collector and optimize very well.

EDIT 2019-10-03: Fixed issues pointed out by Warren Rox. Thanks! Tests: https://ideone.com/mmACYj

like image 138
Cygon Avatar answered Sep 17 '22 19:09

Cygon