Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you read UTF-8 characters from an infinite byte stream - C#

Tags:

c#

stream

Normally, to read characters from a byte stream you use a StreamReader. In this example I'm reading records delimited by '\r' from an infinite stream.

using(var reader = new StreamReader(stream, Encoding.UTF8))
{
    var messageBuilder = new StringBuilder();
    var nextChar = 'x';
    while (reader.Peek() >= 0)
    {
        nextChar = (char)reader.Read()
        messageBuilder.Append(nextChar);

        if (nextChar == '\r')
        {
            ProcessBuffer(messageBuilder.ToString());
            messageBuilder.Clear();
        }
    }
}

The problem is that the StreamReader has a small internal buffer, so if the code waiting for an 'end of record' delimiter ('\r' in this case) it has to wait until the StreamReader's internal buffer is flushed (usually because more bytes have arrived).

This alternative implementation works for single byte UTF-8 characters, but will fail on multibyte characters.

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
while ((byteAsInt = stream.ReadByte()) != -1)
{
    var nextChar = Encoding.UTF8.GetChars(new[]{(byte) byteAsInt});
    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}

How can I modify this code so that it works with multi-byte characters?

like image 780
Mike Hadlow Avatar asked Jul 26 '12 14:07

Mike Hadlow


3 Answers

Rather than Encoding.UTF8.GetChars which is designed to convert complete buffers, get an instance of Decoder and repeatedly call its member method GetChars this will make use of the Decoder's internal buffer to handle partial multi-byte sequences from the end of one call to the next.

like image 169
Richard Avatar answered Oct 06 '22 00:10

Richard


Thanks to Richard, I now have a working infinite stream reader. As he explained, the trick is to use a Decoder instance and call its GetChars method. I've tested it with multi-byte Japanese text and it works fine.

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
var decoder = Encoding.UTF8.GetDecoder();
var nextChar = new char[1];

while ((byteAsInt = stream.ReadByte()) != -1)
{
    var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0);
    if(charCount == 0) continue;

    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}
like image 43
Mike Hadlow Avatar answered Oct 06 '22 00:10

Mike Hadlow


I don't understand why you're not using the stream reader's ReadLine method. If there's a good reason not to, however, it nonetheless seems to me that repeatedly calling GetChars on the decoder is inefficient. Why not make use of the fact that the byte representation of '\r' can't be part of a multi-byte sequence? (Bytes in a multi-byte sequence must be greater than 127; that is, they have the highest bit set.)

var messageBuilder = new List<byte>();

int byteAsInt;
while ((byteAsInt = stream.ReadByte()) != -1)
{
    messageBuilder.Add((byte)byteAsInt);

    if (byteAsInt == '\r')
    {
        var messageString = Encoding.UTF8.GetString(messageBuilder.ToArray());
        Console.Write(messageString);
        ProcessBuffer(messageString);
        messageBuilder.Clear();
    }
}
like image 37
phoog Avatar answered Oct 05 '22 23:10

phoog