Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Substring UTF-8 for both Latin, Chinese, Cyrillic, etc

Tags:

c#

.net

utf-8

On Windows Phone, I want to substring any given string to what's equivalent of 100 ASCII characters in length.

String.Length is obviously useless, as a Chinese string uses 3 bytes per character, a Danish string uses 2 or 4 bytes per character, and a Russian string uses 4 bytes per character.

The only available encoding are UTF-8 and UTF-16. So what do I do?

The idea is this:

private static string UnicodeSubstring(string text, int length)
{
    var bytes = Encoding.UTF8.GetBytes(text);

    return Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, length));
}

But the length needs to be correctly dividable with the number of bytes used for each character, so the last character is always rendered correctly.

like image 372
Claus Jørgensen Avatar asked Sep 13 '12 16:09

Claus Jørgensen


1 Answers

One option is to simply go through the string, computing the number of bytes for each character.

If you know you don't need to deal with characters outside the BMP, this is reasonably simple:

public string SubstringWithinUtf8Limit(string text, int byteLimit)
{
    int byteCount = 0;
    char[] buffer = new char[1];
    for (int i = 0; i < text.Length; i++)
    {
        buffer[0] = text[i];
        byteCount += Encoding.UTF8.GetByteCount(buffer);
        if (byteCount > byteLimit)
        {
            // Couldn't add this character. Return its index
            return text.Substring(0, i);
        }
    }
    return text;
}

It becomes slightly trickier if you have to handle surrogate pairs :(

like image 59
Jon Skeet Avatar answered Sep 30 '22 12:09

Jon Skeet