I want to get a substring of a given length say 150. However, I want to make sure I don't cut off the string in between a unicode character.
e.g. see the following code:
var str = "Hello😀 world!";
var substr = str.Substring(0, 6);
Here substr
is an invalid string since the smiley character is cut in half.
Instead I want a function that does as follows:
var str = "Hello😀 world!";
var substr = str.UnicodeSafeSubstring(0, 6);
where substr
contains "Hello😀"
For reference, here is how I would do it in Objective-C using rangeOfComposedCharacterSequencesForRange
NSString* str = @"Hello😀 world!";
NSRange range = [message rangeOfComposedCharacterSequencesForRange:NSMakeRange(0, 6)];
NSString* substr = [message substringWithRange:range]];
What is the equivalent code in C#?
NET uses UTF-16 to encode the text in a string . A char instance represents a 16-bit code unit. A single 16-bit code unit can represent any code point in the 16-bit range of the Basic Multilingual Plane. But for a code point in the supplementary range, two char instances are needed.
UTF-16 allows all of the basic multilingual plane (BMP) to be represented as single code units. Unicode code points beyond U+FFFF are represented by surrogate pairs. The interesting thing is that Java and Windows (and other systems that use UTF-16) all operate at the code unit level, not the Unicode code point level.
The equivalent in C# is the String class. According to MSDN: (A String) Represents text as a series of Unicode characters. So, if you do string str = "a string here"; , you have a Unicode string.
Looks like you're looking to split a string on graphemes, that is on single displayed characters.
In that case, you have a handy method: StringInfo.SubstringByTextElements
:
var str = "Hello😀 world!";
var substr = new StringInfo(str).SubstringByTextElements(0, 6);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With