I want to split a large array of UTF-8 encoded data, so that decoding it into chars can be parallelized.
It seems that there's no way to find out how many bytes Encoding.GetCharCount
reads. I also can't use GetByteCount(GetChars(...))
since it decodes the entire array anyways, which is what I'm trying to avoid.
UTF-8 has well-defined byte sequences and is considered self-synchronizing, meaning given any position in bytes
you can find where the character at that position begins at.
The UTF-8 spec (Wikipedia is the easiest link) defines the following byte sequences:
0_______ : ASCII (0-127) char
10______ : Continuation
110_____ : Two-byte character
1110____ : Three-byte character
11110___ : Four-byte character
So, the following method (or something similar) should get your result:
bytes
(bytes.Length
, et. al.)byteCount / sectionCount
byte & 0x80 == 0x00
then you can make this byte part of either sectionbyte & 0xE0 == 0xC0
then you need to seek ahead one byte, and keep it with the current sectionbyte & 0xF0 == 0xE0
then you need to seek ahead two bytes, and keep it with the current sectionbyte & 0xF8 == 0xF0
then you need to seek ahead three bytes, and keep it with the current sectionbyte & 0xC0 == 0x80
then you are in a continuation, and should seek ahead until the first byte that does not fit val & 0xB0 == 0x80
, then keep up to (but not including) this value in the current sectionbyteStart
through byteCount + offset
where offset
can be defined by the test aboveOf course, if we redefine our test as returning the current char start position, we have two cases:
1. If (byte[i] & 0xC0) == 0x80
then we need to move around the array
2. Else, return the current i
(since it's not a continuation)
This gives us the following method:
public static int GetCharStart(ref byte[] arr, int index) =>
(arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
Next, we want to get each section. The easiest way is to use a state-machine (or abuse, depending on how you look at it) to return the sections:
public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
var sectionStart = 0;
var sectionEnd = 0;
for (var i = 0; i < sectionCount; i++)
{
sectionEnd = i == (sectionCount - 1) ? utf8Array.Length : GetCharStart(ref utf8Array, (int)Math.Round((double)utf8Array.Length / sectionCount * i));
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
}
}
Now I built this in this manner because I want to use Parallel.ForEach
to demonstrate the result, which makes it super easy if we have an IEnumerable
, and it also allows me to be extremely lazy with the processing: we only continue to gather sections when needed, which means we can lazily process it and do it on-demand, which is a good thing, no?
Lastly, we need to be able to get a section of bytes, so we have the GetSection
method:
public static byte[] GetSection(ref byte[] array, int start, int end)
{
var result = new byte[end - start];
for (var i = 0; i < result.Length; i++)
{
result[i] = array[i + start];
}
return result;
}
Finally, the demonstration:
var sourceText = "Some test 平仮名, ひらがな string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.";
var source = Encoding.UTF8.GetBytes(sourceText);
Console.WriteLine(sourceText);
var results = new ConcurrentBag<string>();
Parallel.ForEach(GetByteSections(source, 10),
new ParallelOptions { MaxDegreeOfParallelism = 1 },
x => { Console.WriteLine(Encoding.UTF8.GetString(x)); results.Add(Encoding.UTF8.GetString(x)); });
Console.WriteLine();
Console.WriteLine("Assemble the result: ");
Console.WriteLine(string.Join("", results.Reverse()));
Console.ReadLine();
The result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. Some test ???, ?? ?? string that should b e decoded in parallel, thi s demonstrates that we work flawlessly with Parallel. ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. Assemble the result: Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.
Not perfect, but it does the job. If we change MaxDegreesOfParallelism
to a higher value, our string gets jumbled:
Some test ???, ?? e decoded in parallel, thi flawlessly with Parallel. ?? string that should b to using `Parallel.ForEach` ForEach. The only downside that it doesn't take order into account, but oh-well. s demonstrates that we work the way I demonstrate is
So, as you can see, super easy. You'll want to make modifications to allow for correct order-reassembly, but this should demonstrate the trick.
If we modify the GetByteSections
method as follows, the last section is no longer ~2x the size of the remaining ones:
public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
var sectionStart = 0;
var sectionEnd = 0;
var sectionSize = (int)Math.Ceiling((double)utf8Array.Length / sectionCount);
for (var i = 0; i < sectionCount; i++)
{
if (i == (sectionCount - 1))
{
var lengthRem = utf8Array.Length - i * sectionSize;
sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
sectionEnd = utf8Array.Length;
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
}
else
{
sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
sectionStart = sectionEnd;
}
}
}
The result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results. Some test ???, ???? string that should be de coded in parallel, this demonstrates that we work flawless ly with Parallel.ForEach. The only downside to using `Para llel.ForEach` the way I demonstrate is that it doesn't tak e order into account, but oh-well. We can continue to incr ease the length of this string to demonstrate that the las t section is usually about double the size of the other se ctions, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to rem ember that we'll end up with `sectionCount + 1` results. Assemble the result: Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.
And finally, if for some reason you split into an abnormally large number of sections compared to input size (my input size of ~578 bytes at 250 chars demonstrates this) you'll hit an IndexOutOfRangeException
in GetCharStart
, the following version fixes that:
public static int GetCharStart(ref byte[] arr, int index)
{
if (index > arr.Length)
{
index = arr.Length - 1;
}
return (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
}
Of course this leaves you with a bunch of empty results, but when you reassemble the string doesn't change, so I'm not even going to bother posting the full scenario test here. (I leave it up to you to experiment.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With