I have this following code:
var file = //Memory stream with a file in it
var bytes = file.ToArray();
I need to search the bytes
for the first occurrence (if any) of the specified byte sequence: 0xff, 0xd8. (The purpose of this is to find images embedded in files)
So if for example bytes[6501]
contains 0xff
and bytes[6502]
contains 0xd8
, thats a match and I need either the index of the position returned (6501), or a new array, which is a copy of the bytes array, except it doesn't have the keys below 6501 from the old array.
My current solution is to loop:
for (var index = 0; index < bytes.Length; index++)
{
if((new byte[] {0xff, 0xd8}).SequenceEqual(bytes.Skip(index).Take(2))
...
But it's pretty slow when it's handling bigger files.
Is there some more efficient way to handle this?
If this is time-critical code, I found the C# compiler (both Mono's implementation and Microsoft's) to have special logic to optimize simple scan loops.
So from profiling experience, I'd implement a sequence search with a hardcoded first-element search like this:
/// <summary>Looks for the next occurrence of a sequence in a byte array</summary>
/// <param name="array">Array that will be scanned</param>
/// <param name="start">Index in the array at which scanning will begin</param>
/// <param name="sequence">Sequence the array will be scanned for</param>
/// <returns>
/// The index of the next occurrence of the sequence of -1 if not found
/// </returns>
private static int findSequence(byte[] array, int start, byte[] sequence) {
int end = array.Length - sequence.Length; // past here no match is possible
byte firstByte = sequence[0]; // cached to tell compiler there's no aliasing
while(start <= end) {
// scan for first byte only. compiler-friendly.
if(array[start] == firstByte) {
// scan for rest of sequence
for (int offset = 1;; ++offset) {
if(offset == sequence.Length) { // full sequence matched?
return start;
} else if(array[start + offset] != sequence[offset]) {
break;
}
}
}
++start;
}
// end of array reached without match
return -1;
}
Quite a bit longer than other suggestions and prone to off-by-1 errors, but if you're scanning a huge chunk of data or doing this for frequent device IO, this setup will avoid feeding the garbage collector and optimize very well.
EDIT 2019-10-03: Fixed issues pointed out by Warren Rox. Thanks! Tests: https://ideone.com/mmACYj
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With