Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to I serialize a large graph of .NET object into a SQL Server BLOB without creating a large buffer?

We have code like:

ms = New IO.MemoryStream
bin = New System.Runtime.Serialization.Formatters.Binary.BinaryFormatter
bin.Serialize(ms, largeGraphOfObjects)
dataToSaveToDatabase = ms.ToArray()
// put dataToSaveToDatabase in a Sql server BLOB

But the memory steam allocates a large buffer from the large memory heap that is giving us problems. So how can we stream the data without needing enough free memory to hold the serialized objects.

I am looking for a way to get a Stream from SQL server that can then be passed to bin.Serialize() so avoiding keeping all the data in my processes memory.

Likewise for reading the data back...


Some more background.

This is part of a complex numerical processing system that processes data in near real time looking for equipment problems etc, the serialization is done to allow a restart when there is a problem with data quality from a data feed etc. (We store the data feeds and can rerun them after the operator has edited out bad values.)

Therefore we serialize the object a lot more often then we de-serialize them.

The objects we are serializing include very large arrays mostly of doubles as well as a lot of small “more normal” objects. We are pushing the memory limit on 32 bit systems and make the garbage collector work very hard. (Effects are being made elsewhere in the system to improve this, e.g. reusing large arrays rather then create new arrays.)

Often the serialization of the state is the last straw that causes an out of memory exception; the peak of our memory usage is always during this serialization step.

I think we get large memory pool fragmentation when we de-serialize the object, I expect there are also other problems with large memory pool fragmentation given the size of the arrays. (This has not yet been investigated, as the person that first looked at this is a numerical processing expert, not a memory management expert.)

Our customers use a mix of SQL Server 2000, 2005 and 2008 and we would rather not have different code paths for each version of SQL Server if possible.

We can have many active models at a time (in different processes, across many machines), each model can have many saved states. Hence the saved state is stored in a database blob rather then a file.

As the spread of saving the state is important, I would rather not serialize the object to a file, and then put the file in a BLOB one block at a time.

Other related questions I have asked

  • How to Stream data from/to SQL Server BLOB fields?
  • Is there a SqlFileStream like class that works with Sql Server 2005?
like image 913
Ian Ringrose Avatar asked Jan 20 '10 12:01

Ian Ringrose


3 Answers

There is no built-in ADO.Net functionality to handle this really gracefully for large data. The problem is two fold:

  • there is no API to 'write' into a SQL command(s) or parameters as into a stream. The parameter types that accept a stream (like FileStream) accept the stream to READ from it, which does not agree with the serialization semantics of write into a stream. No matter which way you turn this, you end up with a in memory copy of the entire serialized object, bad.
  • even if the point above would be solved (and it cannot be), the TDS protocol and the way SQL Server accepts parameters do not work well with large parameters as the entire request has to be first received before it is launched into execution and this would create additional copies of the object inside SQL Server.

So you really have to approach this from a different angle. Fortunately, there is a fairly easy solution. The trick is to use the highly efficient UPDATE .WRITE syntax and pass in the chunks of data one by one, in a series of T-SQL statements. This is the MSDN recommended way, see Modifying Large-Value (max) Data in ADO.NET. This looks complicated, but is actually trivial to do and plug into a Stream class.


The BlobStream class

This is the bread and butter of the solution. A Stream derived class that implements the Write method as a call to the T-SQL BLOB WRITE syntax. Straight forward, the only thing interesting about it is that it has to keep track of the first update because the UPDATE ... SET blob.WRITE(...) syntax would fail on a NULL field:

class BlobStream: Stream
{
    private SqlCommand cmdAppendChunk;
    private SqlCommand cmdFirstChunk;
    private SqlConnection connection;
    private SqlTransaction transaction;

    private SqlParameter paramChunk;
    private SqlParameter paramLength;

    private long offset;

    public BlobStream(
        SqlConnection connection,
        SqlTransaction transaction,
        string schemaName,
        string tableName,
        string blobColumn,
        string keyColumn,
        object keyValue)
    {
        this.transaction = transaction;
        this.connection = connection;
        cmdFirstChunk = new SqlCommand(String.Format(@"
UPDATE [{0}].[{1}]
    SET [{2}] = @firstChunk
    WHERE [{3}] = @key"
            ,schemaName, tableName, blobColumn, keyColumn)
            , connection, transaction);
        cmdFirstChunk.Parameters.AddWithValue("@key", keyValue);
        cmdAppendChunk = new SqlCommand(String.Format(@"
UPDATE [{0}].[{1}]
    SET [{2}].WRITE(@chunk, NULL, NULL)
    WHERE [{3}] = @key"
            , schemaName, tableName, blobColumn, keyColumn)
            , connection, transaction);
        cmdAppendChunk.Parameters.AddWithValue("@key", keyValue);
        paramChunk = new SqlParameter("@chunk", SqlDbType.VarBinary, -1);
        cmdAppendChunk.Parameters.Add(paramChunk);
    }

    public override void Write(byte[] buffer, int index, int count)
    {
        byte[] bytesToWrite = buffer;
        if (index != 0 || count != buffer.Length)
        {
            bytesToWrite = new MemoryStream(buffer, index, count).ToArray();
        }
        if (offset == 0)
        {
            cmdFirstChunk.Parameters.AddWithValue("@firstChunk", bytesToWrite);
            cmdFirstChunk.ExecuteNonQuery();
            offset = count;
        }
        else
        {
            paramChunk.Value = bytesToWrite;
            cmdAppendChunk.ExecuteNonQuery();
            offset += count;
        }
    }

    // Rest of the abstract Stream implementation
 }

Using the BlobStream

To use this newly created blob stream class you plug into a BufferedStream. The class has a trivial design that handles only writing the stream into a column of a table. I'll reuse a table from another example:

CREATE TABLE [dbo].[Uploads](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [FileName] [varchar](256) NULL,
    [ContentType] [varchar](256) NULL,
    [FileData] [varbinary](max) NULL)

I'll add a dummy object to be serialized:

[Serializable]
class HugeSerialized
{
    public byte[] theBigArray { get; set; }
}

Finally, the actual serialization. We'll first insert a new record into the Uploads table, then create a BlobStream on the newly inserted Id and call the serialization straight into this stream:

using (SqlConnection conn = new SqlConnection(Settings.Default.connString))
{
    conn.Open();
    using (SqlTransaction trn = conn.BeginTransaction())
    {
        SqlCommand cmdInsert = new SqlCommand(
@"INSERT INTO dbo.Uploads (FileName, ContentType)
VALUES (@fileName, @contentType);
SET @id = SCOPE_IDENTITY();", conn, trn);
        cmdInsert.Parameters.AddWithValue("@fileName", "Demo");
        cmdInsert.Parameters.AddWithValue("@contentType", "application/octet-stream");
        SqlParameter paramId = new SqlParameter("@id", SqlDbType.Int);
        paramId.Direction = ParameterDirection.Output;
        cmdInsert.Parameters.Add(paramId);
        cmdInsert.ExecuteNonQuery();

        BlobStream blob = new BlobStream(
            conn, trn, "dbo", "Uploads", "FileData", "Id", paramId.Value);
        BufferedStream bufferedBlob = new BufferedStream(blob, 8040);

        HugeSerialized big = new HugeSerialized { theBigArray = new byte[1024 * 1024] };
        BinaryFormatter bf = new BinaryFormatter();
        bf.Serialize(bufferedBlob, big);

        trn.Commit();
    }
}

If you monitor the execution of this simple sample you'll see that nowhere is a large serialization stream created. The sample will allocate the array of [1024*1024] but that is for demo purposes to have something to serialize. This code serializes in a buffered manner, chunk by chunk, using the SQL Server BLOB recommended update size of 8040 bytes at a time.

like image 98
Remus Rusanu Avatar answered Nov 08 '22 23:11

Remus Rusanu


All you need is .NET Framework 4.5 and streaming. Let's assume we have a big file on HDD and we want to upload this file.

SQL code:

CREATE TABLE BigFiles 
(
    [BigDataID] [int] IDENTITY(1,1) NOT NULL,
    [Data] VARBINARY(MAX) NULL
)

C# code:

using (FileStream sourceStream = new FileStream(filePath, FileMode.Open))
{
    using (SqlCommand cmd = new SqlCommand(string.Format("UPDATE BigFiles SET Data=@Data WHERE BigDataID = @BigDataID"), _sqlConn))
    {
        cmd.Parameters.AddWithValue("@Data", sourceStream);
        cmd.Parameters.AddWithValue("@BigDataID", entryId);

        cmd.ExecuteNonQuery();
    }
}

Works good for me. I have successfully uploaded the file of 400 mb, while MemoryStream throwed an exception when I tried to load this file into memory.

UPD: This code works on Windows 7, but failed on Windows XP and 2003 Server.

like image 37
vladimir khozeyev Avatar answered Nov 08 '22 23:11

vladimir khozeyev


You can always write to SQL Server at a lower level using the over the wire protocol TDS (tabular data stream) that Microsoft has used since day one. They are unlikely to change it any time soon as even SQLAzure uses it!

You can see source code of how this works from the Mono project and from the freetds project

Check out the tds_blob

  • http://www.mono-project.com/TDS_Generic

  • http://www.mono-project.com/SQLClient

  • http://www.freetds.org/

like image 4
TFD Avatar answered Nov 09 '22 00:11

TFD