Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are IBinarySerialize Interface methods used for?

When you create a custom aggregate function you need to specified the enumeration format:

Format Enumeration is used by SqlUserDefinedTypeAttribute and SqlUserDefinedAggregateAttribute to indicate the serialization format of a user-defined type (UDT) or aggregate.

and when UserDefined format is used, your class need to implement IBinarySerialize Interface and override its read and write methods.

My question is what exactly these methods need to do?

Looking at the examples, I guess they should be able to read/write the aggregation result?

For example, I am trying to create a SQL CLR function that concatenates distinct numbers. In the T-SQL I can have from 1 to 255 distinct numbers (TINYINT value). I need to create a string from them (using delimiter), but sorting the numbers as well. The function seems to work, but I am not exactly sure I have override the methods as expected:

[Serializable]
[
    Microsoft.SqlServer.Server.SqlUserDefinedAggregate
    (
        Microsoft.SqlServer.Server.Format.UserDefined,
        IsInvariantToNulls = true,
        IsInvariantToDuplicates = true,
        IsInvariantToOrder = false,
        MaxByteSize = 1024
    )
]
public class ConcatenateAnswersPos : Microsoft.SqlServer.Server.IBinarySerialize
{
    private List<byte> intermediateResult;

    public void Init()
    {
        intermediateResult = new List<byte>();
    }

    public void Accumulate(SqlByte value)
    {
        intermediateResult.Add((byte)value);
    }

    public void Merge(ConcatenateAnswersPos other)
    {
        intermediateResult.AddRange(other.intermediateResult);
    }

    public SqlString Terminate()
    {
        if (intermediateResult != null)
        {
            intermediateResult.Sort();
            return new SqlString(string.Join(";", intermediateResult));
        }
        else
        {
            return new SqlString("");
        }

    }

    public void Read(BinaryReader r)
    {
        if (r == null) throw new ArgumentNullException("r");

        intermediateResult = new List<byte>();
        string[] answers = r.ReadString().Split(';');

        foreach (string answer in answers)
        {
            intermediateResult.Add(Convert.ToByte(answer));
        }
    }

    public void Write(BinaryWriter w)
    {
        if (w == null) throw new ArgumentNullException("w");
        intermediateResult.Sort();
        w.Write(string.Join(";", intermediateResult));
    }
}
like image 544
gotqn Avatar asked Jan 05 '15 14:01

gotqn


2 Answers

Any particular instance of the User-Defined Aggregate (UDA) is not guaranteed to exist throughout the entire lifetime of the query. It needs to have a storable representation. When you only use value types (as noted in the "enumeration format" link in the question), the provided Read and Write methods understand how to serialize and deserialize the UDA, in which case you would use Format.Native. But when you start using reference types (String, collections, custom types, etc) then it is up to you to define how those values get serialized and deserialized, in which case you need to use Format.UserDefined and override the Read and Write methods so you can be in control of those operations.

The values that need to be serialized are anything that is needed to bring the new instance of the UDA back to the exact state that it was in prior to it getting serialized. This means: do not rely upon the Init() method running (it runs once per group!) or variable initializers (they run once per instantiation, and the UDA can be re-used for multiple groups without re-creating!). So you need to serialize all base values, even if they are not directly related to the final output.


That said, you should, at the very least, do the optimizations noted in @Damien_The_Unbeliever's answer:

  • Don't do the sort in the Write method. You are already doing it in the Terminate method (the appropriate place), so it is useless to do twice, not to mention very inefficient.

  • Store the count of the collection and then the individual elements

Beyond that:

  • When you say that your UDA "concatenates distinct numbers" if you really meant "distinct" then you need to check each number as it comes in to see if it is in the list already. I suspect this is your desire since you have IsInvariantToDuplicates set to true. You would do this both in the Accumulate method:

    if (!intermediateResult.Contains(value.Value))
    {
      intermediateResult.Add(value.Value);
    }
    

    and in the Merge method (called when parallelism is involved):

    foreach (byte _NewValue in other.intermediateResult)
    {
      if (!intermediateResult.Contains(_NewValue))
      {
        intermediateResult.Add(_NewValue);
      }
    }    
    

    Please note that I changed your cast -- (byte)value -- in the Accumulate method into using the Value property. All of the SqlTypes (e.g. SqlByte, SqlString, SqlInt32, etc.) have a Value property that returns the .NET type that you would expect. This means that there is no need to call ToString() on a SqlString as many people seem to do.

  • I would be cautious with a MaxByteSize of 1024. This concern would be partially mitigated by implementing @Damien's suggestions given that saving "165;207" in a string (current method) is technically 14 bytes (7 characters * 2 bytes per char) whereas saving the count and individual bytes is only 6 bytes (4 bytes for the Int32 to store the count + 2 individual bytes). And this disparity is just for storing 2 values. Storing 200? Yeesh!

  • You do not have the IsNullIfEmpty property specified. You need to specify this, especially considering that your Terminate method is returning an empty string if the internal collection is null. You should add IsNullIfEmpty = false since you don't want to return NULL if this is never called.

  • The extra logic in the Terminate method to handle a null collection is probably not necessary. The collection is initialized in the Init and Read methods, so I am not sure how it could ever be null by the time Terminate is called.


If you want an example of creating a User-Defined Aggregate with Format.UserDefined, then take a look at Getting The Most Out of SQL Server 2005 UDTs and UDAs (free registration required). I wrote that prior to SQL Server 2008 coming out which allowed for more than 8000 bytes to be serialized, so you can ignore (for the moment) the aspects regarding compressing the data to serialize.

Also, if you want to learn more about SQLCLR in general, I am writing a series on it for SQL Server Central: Stairway to SQLCLR (same site as the first linked article).

like image 183
Solomon Rutzky Avatar answered Nov 15 '22 16:11

Solomon Rutzky


I would say that you're doing more work in your methods than you need to. All you need to do is write enough in the Write method so that your Read method can reconstruct your internal state. Since your internal state is just a List<byte>, there no need to treat everything as strings:

public void Read(BinaryReader r)
{
    if (r == null) throw new ArgumentNullException("r");

    var count= r.ReadInt32();

    intermediateResult = new List<byte>(count);
    for (int i=0;i<count;i++)
    {
        intermediateResult.Add(r.ReadByte());
    }
}

public void Write(BinaryWriter w)
{
    if (w == null) throw new ArgumentNullException("w");
    w.Write(intermediateResult.Count);
    foreach(byte b in intermediateResult)
    {
      w.Write(b);
    }
}

And as I suggested in the comments, I've also removed the Sort from the Write method since there'll always be a final Sort call in Terminate before the data you've constructed is passed to consumers of your aggregate.


We store the Count first in the data so that we know how many times to call ReadByte in the Read method. This also allows a (probably pointless) optimization that we can tell the List<byte> constructor exactly how much items it needs space for.

like image 26
Damien_The_Unbeliever Avatar answered Nov 15 '22 18:11

Damien_The_Unbeliever