Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse UTF8 string from ReadOnlySequence<byte>

Tags:

c#

.net

How can I parse a UTF8 string from a ReadOnlySequence

ReadOnlySequence is made of parts, and seeing as UTF8 characters are variable length the break in the parts could be in the middle of a character . So simply using Encoding.UTF8.GetString() on the parts and combining them in a StringBuilder will not work.

Is it possible to parse a UTF8 string from a ReadOnlySequence without first combining them into an array. I would prefer to avoid a memory allocation here.

like image 463
trampster Avatar asked Jul 10 '19 22:07

trampster


2 Answers

The first thing we should do here is test whether the sequence actually is a single span; if it is, we can hugely simplify and optimize.

Once we know that we have a multi-segment (discontiguous) buffer, there are two ways we can go:

  1. linearize the segments into a contiguous buffer, probably leasing an oversized buffer from ArrayPool.Shared, and use UTF8.GetString on the correct portion of the leased buffer, or
  2. use the GetDecoder() API on the encoding, and use that to populate a new string, which on older frameworks means overwriting a newly allocated string, or in newer frameworks means using the string.Create API

The first option is massively simpler, but involves a few memory-copy operations (but no additional allocations other than the string):

public static string GetString(in this ReadOnlySequence<byte> payload,
    Encoding encoding = null)
{
    encoding ??= Encoding.UTF8;
    return payload.IsSingleSegment ? encoding.GetString(payload.FirstSpan)
        : GetStringSlow(payload, encoding);

    static string GetStringSlow(in ReadOnlySequence<byte> payload, Encoding encoding)
    {
        // linearize
        int length = checked((int)payload.Length);
        var oversized = ArrayPool<byte>.Shared.Rent(length);
        try
        {
            payload.CopyTo(oversized);
            return encoding.GetString(oversized, 0, length);
        }
        finally
        {
            ArrayPool<byte>.Shared.Return(oversized);
        }
    }
}
like image 179
Marc Gravell Avatar answered Sep 28 '22 05:09

Marc Gravell


It seems that .NET 5.0 introduced EncodingExtensions.GetString to solve this problem.

Decodes the specified ReadOnlySequence into a String using the specified Encoding.

using System.Text;

string message = EncodingExtensions.GetString(Encoding.UTF8, buffer);
like image 40
Ryan Avatar answered Sep 28 '22 05:09

Ryan