I have the following string:
友𠂇又
The corresponding UTF-16 representation (little-endian) is
CB 53 40 D8 87 DC C8 53
\___/ \_________/ \___/
友 𠂇 又
"友𠂇又".Length
returns 4, because the string is stored as 4 2-byte characters by the CLR.
How do I measure the length of my string? How do I split it into { "友", "𠂇", "又" }
?
As documented:
The
Length
property returns the number ofChar
objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than oneChar
. Use theSystem.Globalization.StringInfo
class to work with each Unicode character instead of each Char.
Getting length:
new System.Globalization.StringInfo("友𠂇又").LengthInTextElements
Getting each Unicode character is documented here, but it's much more convenient to make an extension method:
public static IEnumerable<string> TextElements(this string s) {
var en = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (en.MoveNext())
{
yield return en.GetTextElement();
}
}
and use it in a foreach
or in a LINQ statement:
foreach (string segment in "友𠂇又".TextElements())
{
Console.WriteLine(segment);
}
which also can be used for length:
Console.WriteLine("友𠂇又".TextElements().Count());
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With