Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid allocation of large Byte[] when computing hash of large strings

I am on a mission to eliminate all (or as many as I can) allocations to the Large Object Heap as possible in my applications. One of the biggest offenders is our code that computes the MD5 hash of a large string.

public static string MD5Hash(this string s)
{
    using (MD5CryptoServiceProvider csp = new MD5CryptoServiceProvider())
    {
         byte[] bytesToHash = Encoding.UTF8.GetBytes(s);
         byte[] hashBytes = csp.ComputeHash(bytesToHash);
         return Convert.ToBase64String(hashBytes);
    }
 }

Leave for the sake of the example that the string itself is probably already in the LOH. Our goal is to prevent more allocations to the heap.

Also, the current implementation assumes UTF8 encoding (a big assumption), but really the goal is to generate a byte[] from a string.

The MD5CryptoServiceProvider can take a Stream as input, so we can create a method:

public static string MD5Hash(this Stream stream)
{
    using (MD5CryptoServiceProvider csp = new MD5CryptoServiceProvider())
    {
         return Convert.ToBase64String(csp.ComputeHash(stream));
    }
}

This is promising because we don't need a byte[] for ComputeHash to work. We need a stream object that will read bytes from a string as bytes are requested by ComputeHash.

This rather controvesial question provides a method for creating a byte array from a string regardless of encoding. However, we want to avoid the creation of a large byte array.

This question provides a method of creating a stream from a string by reading the string into a MemoryStream, but internally that is just allocating a large byte[] array as well.

Neither really do the trick.

So how can you avoid the allocation of a large byte[]? Is there a Stream class that will read from another stream (or reader) as bytes are read?

like image 286
Joe Enzminger Avatar asked Oct 13 '25 11:10

Joe Enzminger


2 Answers

If you don't care about the encoding, then one thing that you can do to prevent any further buffer allocation is to use some unsafe code. I.e. get to the raw bytes of the string, wrap an instance of UnmanagedMemoryStream around it and feed that to the MD5 crypto calculation.

So something like this:

public static string MD5Hash(this string s)
{
    using (MD5CryptoServiceProvider csp = new MD5CryptoServiceProvider())
    {
        unsafe
        {
            fixed (char* input = s)
            {
                using (var stream = new UnmanagedMemoryStream((byte*)input, sizeof(char) * s.Length))
                    return Convert.ToBase64String(csp.ComputeHash(stream)); 
            }
        }
    }
}
like image 131
Alex Avatar answered Oct 15 '25 02:10

Alex


You can implement your own stream backed by a string.

Note that basically you only need to implement Read and Write, accordingly with the documentation (but just throw a NotSupportedException on Write since you should not write to this stream):

When you implement a derived class of Stream, you must provide implementations for the Read and Write methods. The asynchronous methods ReadAsync, WriteAsync, and CopyToAsync use the synchronous methods Read and Write in their implementations.

You probably want to also implement ReadByte:

The default implementations of ReadByte and WriteByte create a new single-element byte array, and then call your implementations of Read and Write

Source: https://msdn.microsoft.com/pt-br/library/system.io.stream%28v=vs.110%29.aspx

like image 20
Filipe Borges Avatar answered Oct 15 '25 02:10

Filipe Borges