Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find nearest safe split in byte array containing UTF-8 data

I want to split a large array of UTF-8 encoded data, so that decoding it into chars can be parallelized.

It seems that there's no way to find out how many bytes Encoding.GetCharCount reads. I also can't use GetByteCount(GetChars(...)) since it decodes the entire array anyways, which is what I'm trying to avoid.

like image 538
cmpxchg8b Avatar asked Mar 08 '23 14:03

cmpxchg8b


1 Answers

UTF-8 has well-defined byte sequences and is considered self-synchronizing, meaning given any position in bytes you can find where the character at that position begins at.

The UTF-8 spec (Wikipedia is the easiest link) defines the following byte sequences:

0_______ : ASCII (0-127) char
10______ : Continuation
110_____ : Two-byte character
1110____ : Three-byte character
11110___ : Four-byte character

So, the following method (or something similar) should get your result:

  1. Get the byte count for bytes (bytes.Length, et. al.)
  2. Determine how many sections to split into
  3. Select byte byteCount / sectionCount
  4. Test byte against table:
    1. If byte & 0x80 == 0x00 then you can make this byte part of either section
    2. If byte & 0xE0 == 0xC0 then you need to seek ahead one byte, and keep it with the current section
    3. If byte & 0xF0 == 0xE0 then you need to seek ahead two bytes, and keep it with the current section
    4. If byte & 0xF8 == 0xF0 then you need to seek ahead three bytes, and keep it with the current section
    5. If byte & 0xC0 == 0x80 then you are in a continuation, and should seek ahead until the first byte that does not fit val & 0xB0 == 0x80, then keep up to (but not including) this value in the current section
  5. Select byteStart through byteCount + offset where offset can be defined by the test above
  6. Repeat for each section.

Of course, if we redefine our test as returning the current char start position, we have two cases: 1. If (byte[i] & 0xC0) == 0x80 then we need to move around the array 2. Else, return the current i (since it's not a continuation)

This gives us the following method:

public static int GetCharStart(ref byte[] arr, int index) =>
    (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;

Next, we want to get each section. The easiest way is to use a state-machine (or abuse, depending on how you look at it) to return the sections:

public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
    var sectionStart = 0;
    var sectionEnd = 0;

    for (var i = 0; i < sectionCount; i++)
    {
        sectionEnd = i == (sectionCount - 1) ? utf8Array.Length : GetCharStart(ref utf8Array, (int)Math.Round((double)utf8Array.Length / sectionCount * i));
        yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
        sectionStart = sectionEnd;
    }
}

Now I built this in this manner because I want to use Parallel.ForEach to demonstrate the result, which makes it super easy if we have an IEnumerable, and it also allows me to be extremely lazy with the processing: we only continue to gather sections when needed, which means we can lazily process it and do it on-demand, which is a good thing, no?

Lastly, we need to be able to get a section of bytes, so we have the GetSection method:

public static byte[] GetSection(ref byte[] array, int start, int end)
{
    var result = new byte[end - start];
    for (var i = 0; i < result.Length; i++)
    {
        result[i] = array[i + start];
    }
    return result;
}

Finally, the demonstration:

var sourceText = "Some test 平仮名, ひらがな string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.";
var source = Encoding.UTF8.GetBytes(sourceText);
Console.WriteLine(sourceText);

var results = new ConcurrentBag<string>();
Parallel.ForEach(GetByteSections(source, 10),
                    new ParallelOptions { MaxDegreeOfParallelism = 1 },
                    x => { Console.WriteLine(Encoding.UTF8.GetString(x)); results.Add(Encoding.UTF8.GetString(x)); });

Console.WriteLine();
Console.WriteLine("Assemble the result: ");
Console.WriteLine(string.Join("", results.Reverse()));
Console.ReadLine();

The result:

Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.

Some test ???, ??
?? string that should b
e decoded in parallel, thi
s demonstrates that we work
 flawlessly with Parallel.
ForEach. The only downside
to using `Parallel.ForEach`
 the way I demonstrate is
that it doesn't take order into account, but oh-well.

Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.

Not perfect, but it does the job. If we change MaxDegreesOfParallelism to a higher value, our string gets jumbled:

Some test ???, ??
e decoded in parallel, thi
 flawlessly with Parallel.
?? string that should b
to using `Parallel.ForEach`
ForEach. The only downside
that it doesn't take order into account, but oh-well.
s demonstrates that we work
 the way I demonstrate is

So, as you can see, super easy. You'll want to make modifications to allow for correct order-reassembly, but this should demonstrate the trick.

If we modify the GetByteSections method as follows, the last section is no longer ~2x the size of the remaining ones:

public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
    var sectionStart = 0;
    var sectionEnd = 0;
    var sectionSize = (int)Math.Ceiling((double)utf8Array.Length / sectionCount);

    for (var i = 0; i < sectionCount; i++)
    {
        if (i == (sectionCount - 1))
        {
            var lengthRem = utf8Array.Length - i * sectionSize;
            sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
            yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
            sectionStart = sectionEnd;
            sectionEnd = utf8Array.Length;
            yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
        }
        else
        {
            sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
            yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
            sectionStart = sectionEnd;
        }
    }
}

The result:

Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.

Some test ???, ???? string that should be de
coded in parallel, this demonstrates that we work flawless
ly with Parallel.ForEach. The only downside to using `Para
llel.ForEach` the way I demonstrate is that it doesn't tak
e order into account, but oh-well. We can continue to incr
ease the length of this string to demonstrate that the las
t section is usually about double the size of the other se
ctions, we could fix that if we really wanted to. In fact,
 with a small modification it does so, we just have to rem
ember that we'll end up with `sectionCount + 1` results.

Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.

And finally, if for some reason you split into an abnormally large number of sections compared to input size (my input size of ~578 bytes at 250 chars demonstrates this) you'll hit an IndexOutOfRangeException in GetCharStart, the following version fixes that:

public static int GetCharStart(ref byte[] arr, int index)
{
    if (index > arr.Length)
    {
        index = arr.Length - 1;
    }

    return (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
}

Of course this leaves you with a bunch of empty results, but when you reassemble the string doesn't change, so I'm not even going to bother posting the full scenario test here. (I leave it up to you to experiment.)

like image 174
Der Kommissar Avatar answered Mar 12 '23 00:03

Der Kommissar