Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split a Unicode string into multiple Unicode characters in C#?

Tags:

c#

unicode

If I have a string like "๐Ÿ˜€123๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ", how can I split it into an array, which would look like ["๐Ÿ˜€", "1", "2", "3", "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"]? If I use ToCharArray() the first Emoji is split into 2 characters and the second into 7 characters.

Update

The solution now looks like this:

public static List<string> GetCharacters(string text)
{
    char[] ca = text.ToCharArray();
    List<string> characters = new List<string>();
    for (int i = 0; i < ca.Length; i++)
    {
        char c = ca[i];
        if (c > โ€ญ65535โ€ฌ) continue;
        if (char.IsHighSurrogate(c))
        {
            i++;
            characters.Add(new string(new[] { c, ca[i] }));
        }
        else
            characters.Add(new string(new[] { c }));
    }
    return characters;
}

Please note that, as mentioned in the comments, it doesn't work for the family emoji. It only works for emojis that have 2 characters or less. The output of the example would be: ["๐Ÿ˜€", "1", "2", "3", "๐Ÿ‘จโ€", "๐Ÿ‘ฉโ€", "๐Ÿ‘งโ€", "๐Ÿ‘ฆ"]

like image 877
mjw Avatar asked Feb 14 '17 13:02

mjw


People also ask

Can a string be split on multiple characters?

Method 1: Split multiple characters from string using re. split() This is the most efficient and commonly used method to split multiple characters at once. It makes use of regex(regular expressions) in order to do this.

How do I split a string into another character?

You can use String. Split() method with params char[] ; Returns a string array that contains the substrings in this instance that are delimited by elements of a specified Unicode character array.

How does C handle Unicode?

It can represent all 1,114,112 Unicode characters. Most C code that deals with strings on a byte-by-byte basis still works, since UTF-8 is fully compatible with 7-bit ASCII. Characters usually require fewer than four bytes. String sort order is preserved.

Is Unicode the same as string?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.


1 Answers

.NET represents strings as a sequence of UTF-16 elements. Unicode code points outside the Base Multilingual Plane (BMP) will be split into a high and low surrogate. The lower 10 bits of each forms half of the real code point value.

There are helpers to detect these surrogates (eg. Char.IsLowSurrogate).

You need to handle this yourself.

like image 164
Richard Avatar answered Oct 02 '22 03:10

Richard