Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I read characters in a string as their UTF-32 decimal values?

I have, for example, this Unicode string, which consists of the Cyclone and the Japanese Castle defined in C# and .NET, which uses UTF-16 for its CLR string encoding:

var value = "๐ŸŒ€๐Ÿฏ";

If you check this, you find very quickly that value.Length = 4 because C# uses UTF-16 encoded strings, so for these reasons I can't just loop on each character and get its UTF-32 decimal value: foreach (var character in value) result = (ulong)character;. It begs the question, how can I get the UTF-32 decimal value for each character in any string?

Cyclone should be 127744 and Japanese Castle should be 127983, but I am looking for a general answer that can take any C# string and always produce a UTF-32 decimal value out of each character inside of it.

I've even tried taking a look at Char.ConvertToUtf32, but this seems to be problematic if, for example:

var value = "a๐ŸŒ€c๐Ÿฏ";

This has a length of 6. So, how do I know when a new character begins? For example:

Char.ConvertToUtf32(value, 0)   97  int
Char.ConvertToUtf32(value, 1)   127744  int
Char.ConvertToUtf32(value, 2)   'Char.ConvertToUtf32(value, 2)' threw an exception of type 'System.ArgumentException'   int {System.ArgumentException}
Char.ConvertToUtf32(value, 3)   99  int
Char.ConvertToUtf32(value, 4)   127983  int
Char.ConvertToUtf32(value, 5)   'Char.ConvertToUtf32(value, 5)' threw an exception of type 'System.ArgumentException'   int {System.ArgumentException}

There is also the:

public static int ConvertToUtf32(
    char highSurrogate,
    char lowSurrogate
)

But for me to use this as well I need to figure out when I have surrogate pairs. How can you do that?

like image 896
Alexandru Avatar asked Aug 21 '15 13:08

Alexandru


1 Answers

Solution 1

string value = "๐ŸŒ€๐Ÿฏ";
byte[] rawUtf32AsBytes = Encoding.UTF32.GetBytes(value);
int[] rawUtf32 = new int[rawUtf32AsBytes.Length / 4];
Buffer.BlockCopy(rawUtf32AsBytes, 0, rawUtf32, 0, rawUtf32AsBytes.Length);

Solution 2

string value = "๐ŸŒ€๐Ÿฏ";
List<int> rawUtf32list = new List<int>();
for (int i = 0; i < value.Length; i++)
{
    if (Char.IsHighSurrogate(value[i]))
    {
        rawUtf32list.Add(Char.ConvertToUtf32(value[i], value[i + 1]));
        i++;
    }
    else
        rawUtf32list.Add((int)value[i]);
}

Update:

Starting with .NET Core 3.0 we have the Rune struct that represents a UTF32 character:

string value = "a๐ŸŒ€c๐Ÿฏ";
var runes = value.EnumerateRunes();

// writes a:97, ๐ŸŒ€:127744, c:99, ๐Ÿฏ:127983
Console.WriteLine(String.Join(", ", runes.Select(r => $"{r}:{r.Value}")));
like image 149
Gyรถrgy Kล‘szeg Avatar answered Oct 16 '22 21:10

Gyรถrgy Kล‘szeg