I'm trying to do a substring on a string containing multi byte characters, and I'm not getting the results I expect. I am trying to substring strings like 😂test. The first character is a 4 byte character so calling ToCharArray
on this string returns:
So when I call .Substring(1)
on this string, it returns an invalid string that starts with the third and fourth bytes of the first character, not 'test'. Is there any way to get .Substring
and other string operations to treat that character as a single unit?
The term “multibyte character” is defined by ISO C to denote a byte sequence that encodes an ideogram, no matter what encoding scheme is employed. All multibyte characters are members of the “extended character set.” A regular single-byte character is just a special case of a multibyte character.
A multibyte character is a character composed of sequences of one or more bytes. Each byte sequence represents a single character in the extended character set. Multibyte characters are used in character sets such as Kanji. Wide characters are multilingual character codes that are always 16 bits wide.
A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.
The Substring() method in C# is used to retrieve a substring from this instance. The substring starts at a specified character position and continues to the end of the string.
You want to use StringInfo
var yourstring = "😂test";
StringInfo si = new StringInfo(yourstring);
var substring = si.SubstringByTextElements(1);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With