Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to compare two byte arrays?

I am trying to compare two long bytearrays in VB.NET and have run into a snag. Comparing two 50 megabyte files takes almost two minutes, so I'm clearly doing something wrong. I'm on an x64 machine with tons of memory so there are no issues there. Here is the code that I'm using at the moment and would like to change.

_Bytes and item.Bytes are the two different arrays to compare and are already the same length.

For Each B In item.Bytes
   If B <> _Bytes(I) Then
        Mismatch = True
        Exit For
   End If
   I += 1
Next

I need to be able to compare as fast as possible files that are potentially hundreds of megabytes and even possibly a gigabyte or two. Any suggests or algorithms that would be able to do this faster?

Item.bytes is an object taken from the database/filesystem that is returned to compare, because its byte length matches the item that the user wants to add. By comparing the two arrays I can then determine if the user has added something new to the DB and if not then I can just map them to the other file and not waste hard disk drive space.

[Update]

I converted the arrays to local variables of Byte() and then did the same comparison, same code and it ran in like one second (I have to benchmark it still and compare it to others), but if you do the same thing with local variables and use a generic array it becomes massively slower. I’m not sure why, but it raises a lot more questions for me about the use of arrays.

like image 563
Middletone Avatar asked Mar 09 '09 19:03

Middletone


3 Answers

What is the _Bytes(I) call doing? It's not loading the file each time, is it? Even with buffering, that would be bad news!

There will be plenty of ways to micro-optimise this in terms of looking at longs at a time, potentially using unsafe code etc - but I'd just concentrate on getting reasonable performance first. Clearly there's something very odd going on.

I suggest you extract the comparison code into a separate function which takes two byte arrays. That way you know you won't be doing anything odd. I'd also use a simple For loop rather than For Each in this case - it'll be simpler. Oh, and check whether the lengths are correct first :)

EDIT: Here's the code (untested, but simple enough) that I'd use. It's in C# for the minute - I'll convert it in a sec:

public static bool Equals(byte[] first, byte[] second)
{
    if (first == second)
    {
        return true;
    }
    if (first == null || second == null)
    {
        return false;
    }
    if (first.Length != second.Length)
    {
        return false;
    }
    for (int i=0; i < first.Length; i++)
    {
        if (first[i] != second[i])                
        {
            return false;
        }
    }
    return true;
}

EDIT: And here's the VB:

Public Shared Function ArraysEqual(ByVal first As Byte(), _
                                   ByVal second As Byte()) As Boolean
    If (first Is second) Then
        Return True
    End If

    If (first Is Nothing OrElse second Is Nothing) Then
        Return False
    End If
    If  (first.Length <> second.Length) Then
         Return False
    End If

    For i as Integer = 0 To first.Length - 1
        If (first(i) <> second(i)) Then
            Return False
        End If
    Next i
    Return True
End Function
like image 153
Jon Skeet Avatar answered Sep 29 '22 11:09

Jon Skeet


The fastest way to compare two byte arrays of equal size is to use interop. Run the following code on a console application:

using System;
using System.Runtime.InteropServices;
using System.Security;

namespace CompareByteArray
{
    class Program
    {
        static void Main(string[] args)
        {
            const int SIZE = 100000;
            const int TEST_COUNT = 100;

            byte[] arrayA = new byte[SIZE];
            byte[] arrayB = new byte[SIZE];

            for (int i = 0; i < SIZE; i++)
            {
                arrayA[i] = 0x22;
                arrayB[i] = 0x22;
            }

            {
                DateTime before = DateTime.Now;
                for (int i = 0; i < TEST_COUNT; i++)
                {
                    int result = MemCmp_Safe(arrayA, arrayB, (UIntPtr)SIZE);

                    if (result != 0) throw new Exception();
                }
                DateTime after = DateTime.Now;

                Console.WriteLine("MemCmp_Safe: {0}", after - before);
            }

            {
                DateTime before = DateTime.Now;
                for (int i = 0; i < TEST_COUNT; i++)
                {
                    int result = MemCmp_Unsafe(arrayA, arrayB, (UIntPtr)SIZE);

                    if (result != 0) throw new Exception();
                }
                DateTime after = DateTime.Now;

                Console.WriteLine("MemCmp_Unsafe: {0}", after - before);
            }


            {
                DateTime before = DateTime.Now;
                for (int i = 0; i < TEST_COUNT; i++)
                {
                    int result = MemCmp_Pure(arrayA, arrayB, SIZE);

                    if (result != 0) throw new Exception();
                }
                DateTime after = DateTime.Now;

                Console.WriteLine("MemCmp_Pure: {0}", after - before);
            }
            return;
        }

        [DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint="memcmp", ExactSpelling=true)]
        [SuppressUnmanagedCodeSecurity]
        static extern int memcmp_1(byte[] b1, byte[] b2, UIntPtr count);

        [DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint = "memcmp", ExactSpelling = true)]
        [SuppressUnmanagedCodeSecurity]
        static extern unsafe int memcmp_2(byte* b1, byte* b2, UIntPtr count);

        public static int MemCmp_Safe(byte[] a, byte[] b, UIntPtr count)
        {
            return memcmp_1(a, b, count);
        }

        public unsafe static int MemCmp_Unsafe(byte[] a, byte[] b, UIntPtr count)
        {
            fixed(byte* p_a = a)
            {
                fixed (byte* p_b = b)
                {
                    return memcmp_2(p_a, p_b, count);
                }
            }
        }

        public static int MemCmp_Pure(byte[] a, byte[] b, int count)
        {
            int result = 0;
            for (int i = 0; i < count && result == 0; i += 1)
            {
                result = a[0] - b[0];
            }

            return result;
        }

    }
}
like image 39
danobrega Avatar answered Sep 29 '22 11:09

danobrega


If you don't need to know the byte, use 64-bit ints that gives you 8 at once. Actually, you can figure out the wrong byte, once you've isolated it to a set of 8.

Use BinaryReader:

saveTime  = binReader.ReadInt32()

Or for arrays of ints:

Dim count As Integer = binReader.Read(testArray, 0, 3)
like image 41
sfossen Avatar answered Sep 29 '22 13:09

sfossen