Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find byte sequence within a byte array

Tags:

c#

indexof

I have a byte array and wish to find the "occurrences" of some bytes.

For example 00 69 73 6F 6D in a very large byte array (> 50/100 Megabytes)

OR

even better a reverse operation: Searching the most common pattern without knowing it the code should be able to read and find it from the file.

like image 523
Ben Avatar asked May 28 '16 15:05

Ben


People also ask

How do you find byte array?

We can also get the byte array using the below code. byte[] byteArr = str. getBytes("UTF-8");

What is a byte sequence?

A byte sequence is a mutable data structure that contains a fixed-length sequence of bytes. Each byte can be indexed in constant time for reading or writing. Given a byte sequence s of length l , we can access each of the l bytes of s via its index in the sequence.

What is stored in a byte array?

A byte is 8 bits (binary data). A byte array is an array of bytes (tautology FTW!). You could use a byte array to store a collection of binary data, for example, the contents of a file. The downside to this is that the entire file contents must be loaded into memory.

How do you get a byte from a string?

Convert byte[] to String (text data) toString() to get the string from the bytes; The bytes. toString() only returns the address of the object in memory, NOT converting byte[] to a string ! The correct way to convert byte[] to string is new String(bytes, StandardCharsets. UTF_8) .


1 Answers

You can use the Boyer-Moore algorithm to efficiently search for a sequence of bytes in an array of bytes.

Here's a C# version I converted from the Java version from the Wikipedia entry on Boyer-Moore.

public sealed class BoyerMoore
{
    readonly byte[] needle;
    readonly int[] charTable;
    readonly int[] offsetTable;

    public BoyerMoore(byte[] needle)
    {
        this.needle = needle;
        this.charTable = makeByteTable(needle);
        this.offsetTable = makeOffsetTable(needle);
    }

    public IEnumerable<int> Search(byte[] haystack)
    {
        if (needle.Length == 0)
            yield break;

        for (int i = needle.Length - 1; i < haystack.Length;)
        {
            int j;

            for (j = needle.Length - 1; needle[j] == haystack[i]; --i, --j)
            {
                if (j != 0)
                    continue;

                yield return i;
                i += needle.Length - 1;
                break;
            }

            i += Math.Max(offsetTable[needle.Length - 1 - j], charTable[haystack[i]]);
        }
    }

    static int[] makeByteTable(byte[] needle)
    {
        const int ALPHABET_SIZE = 256;
        int[] table = new int[ALPHABET_SIZE];

        for (int i = 0; i < table.Length; ++i)
            table[i] = needle.Length;

        for (int i = 0; i < needle.Length - 1; ++i)
            table[needle[i]] = needle.Length - 1 - i;

        return table;
    }

    static int[] makeOffsetTable(byte[] needle)
    {
        int[] table = new int[needle.Length];
        int lastPrefixPosition = needle.Length;

        for (int i = needle.Length - 1; i >= 0; --i)
        {
            if (isPrefix(needle, i + 1))
                lastPrefixPosition = i + 1;

            table[needle.Length - 1 - i] = lastPrefixPosition - i + needle.Length - 1;
        }

        for (int i = 0; i < needle.Length - 1; ++i)
        {
            int slen = suffixLength(needle, i);
            table[slen] = needle.Length - 1 - i + slen;
        }

        return table;
    }

    static bool isPrefix(byte[] needle, int p)
    {
        for (int i = p, j = 0; i < needle.Length; ++i, ++j)
            if (needle[i] != needle[j])
                return false;

        return true;
    }

    static int suffixLength(byte[] needle, int p)
    {
        int len = 0;

        for (int i = p, j = needle.Length - 1; i >= 0 && needle[i] == needle[j]; --i, --j)
            ++len;

        return len;
    }
}

Here's some console app test code for it:

public static void Main()
{
    byte[] haystack = new byte[10000];
    byte[] needle = { 0x00, 0x69, 0x73, 0x6F, 0x6D };

    // Put a few copies of the needle into the haystack.

    for (int i = 1000; i <= 9000; i += 1000) 
        Array.Copy(needle, 0, haystack, i, needle.Length);

    var searcher = new BoyerMoore(needle);

    foreach (int index in searcher.Search(haystack))
        Console.WriteLine(index);
}

Note how the Search() method returns the indices of all the locations of the start of needle inside haystack.

If you just wanted the count, you could just do:

int count = new BoyerMoore(needle).Search(haystack).Count();

For your second question: I assume you are asking about finding the longest repeated sequence of bytes?

That's a much more complicated - and very different - question. If you want an answer for that, you should ask a separate question for it, but you should read the Wikipedia entry on the "longest repeated substring problem".

like image 117
Matthew Watson Avatar answered Sep 29 '22 13:09

Matthew Watson