Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Serializing a very large List of items into Azure blob storage using C#

I have a large list of objects that I need to store and retrieve later. The list will always be used as a unit and list items are not retrieved individually. The list contains about 7000 items totaling about 1GB, but could easily escalate to ten times that or more.

We have been using BinaryFormatter.Serialize() to do the serialization (System.Runtime.Serialization.Formatters.Binary.BinaryFormatter). Then, this string was uploaded as a blob to Azure blob storage. We found it to be generally fast and efficient, but it became inadequate as we are testing it with a greater file size, throwing an OutOfMemoryException. From what I understand, although I'm using a stream, my problem is that the BinaryFormatter.Serialize() method must first serialize everything to memory before I can upload the blob, causing my exception.

The binary serializer looks as follows:

public void Upload(object value, string blobName, bool replaceExisting)
{
    CloudBlockBlob blockBlob = BlobContainer.GetBlockBlobReference(blobName);
    var formatter = new BinaryFormatter()
    {
        AssemblyFormat = FormatterAssemblyStyle.Simple,
        FilterLevel = TypeFilterLevel.Low,
        TypeFormat = FormatterTypeStyle.TypesAlways
    };

    using (var stream = blockBlob.OpenWrite())
    {
        formatter.Serialize(stream, value);
    }
}

The OutOfMemoryException occurs on the formatter.Serialize(stream, value) line.

I therefore tried to using a different protocol, Protocol Buffers. I tried using both the implementations in the Nuget packages protobuf-net and Google.Protobuf, but the serialization was horribly slow (roughly 30mins) and, from what I have read, Protobuf is not optimized for serializing data larger than 1MB. So, I went back to the drawing board, and came across Cap'n Proto, which promised to solve my speed issues by using memory mapping. I am trying to use @marc-gravell 's C# bindings but I am having some difficulty implementing a serializer, as the project does not have thorough documentation yet. Moreover, I'm not 100% sure that Cap'n Proto is the correct choice of protocol - but I am struggling to find any alternative suggestions online.

How can I serialize a very large collection of items to blob storage, without hitting memory issues, and in a reasonably fast way?

like image 942
08Dc91wk Avatar asked Nov 09 '22 17:11

08Dc91wk


1 Answers

Perhaps you should switch to JSON?

Using the JSON Serializer, you can stream to and from files and serialize/deserialize piecemeal (as the file is read).

Would your objects map to JSON well?

This is what I use to take a NetworkStream and put into a Json Object.

    private static async Task<JObject> ProcessJsonResponse(HttpResponseMessage response)
    {
        // Open the stream the stream from the network
        using (var s = await ProcessResponseStream(response).ConfigureAwait(false))
        {
            using (var sr = new StreamReader(s))
            {
                using (var reader = new JsonTextReader(sr))
                {
                    var serializer = new JsonSerializer {DateParseHandling = DateParseHandling.None};

                    return serializer.Deserialize<JObject>(reader);
                }
            }
        }
    }

Additionally, you could GZip the stream to reduce the file transfer times. We stream directly to GZipped JSON and back again.

Edit, although this is a Deserialize, the same approach should work for a Serialize

like image 116
James Woodall Avatar answered Nov 14 '22 23:11

James Woodall