I have, for example, this Unicode string, which consists of the Cyclone and the Japanese Castle defined in C# and .NET, which uses UTF-16 for its CLR string encoding:
var value = "๐๐ฏ";
If you check this, you find very quickly that value.Length = 4
because C# uses UTF-16 encoded strings, so for these reasons I can't just loop on each character and get its UTF-32 decimal value: foreach (var character in value) result = (ulong)character;
. It begs the question, how can I get the UTF-32 decimal value for each character in any string?
Cyclone should be 127744
and Japanese Castle should be 127983
, but I am looking for a general answer that can take any C# string and always produce a UTF-32 decimal value out of each character inside of it.
I've even tried taking a look at Char.ConvertToUtf32, but this seems to be problematic if, for example:
var value = "a๐c๐ฏ";
This has a length of 6. So, how do I know when a new character begins? For example:
Char.ConvertToUtf32(value, 0) 97 int
Char.ConvertToUtf32(value, 1) 127744 int
Char.ConvertToUtf32(value, 2) 'Char.ConvertToUtf32(value, 2)' threw an exception of type 'System.ArgumentException' int {System.ArgumentException}
Char.ConvertToUtf32(value, 3) 99 int
Char.ConvertToUtf32(value, 4) 127983 int
Char.ConvertToUtf32(value, 5) 'Char.ConvertToUtf32(value, 5)' threw an exception of type 'System.ArgumentException' int {System.ArgumentException}
There is also the:
public static int ConvertToUtf32(
char highSurrogate,
char lowSurrogate
)
But for me to use this as well I need to figure out when I have surrogate pairs. How can you do that?
Solution 1
string value = "๐๐ฏ";
byte[] rawUtf32AsBytes = Encoding.UTF32.GetBytes(value);
int[] rawUtf32 = new int[rawUtf32AsBytes.Length / 4];
Buffer.BlockCopy(rawUtf32AsBytes, 0, rawUtf32, 0, rawUtf32AsBytes.Length);
Solution 2
string value = "๐๐ฏ";
List<int> rawUtf32list = new List<int>();
for (int i = 0; i < value.Length; i++)
{
if (Char.IsHighSurrogate(value[i]))
{
rawUtf32list.Add(Char.ConvertToUtf32(value[i], value[i + 1]));
i++;
}
else
rawUtf32list.Add((int)value[i]);
}
Update:
Starting with .NET Core 3.0 we have the Rune
struct that represents a UTF32 character:
string value = "a๐c๐ฏ";
var runes = value.EnumerateRunes();
// writes a:97, ๐:127744, c:99, ๐ฏ:127983
Console.WriteLine(String.Join(", ", runes.Select(r => $"{r}:{r.Value}")));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With