Substring UTF-8 for both Latin, Chinese, Cyrillic, etc

Question

On Windows Phone, I want to substring any given string to what's equivalent of 100 ASCII characters in length.

String.Length is obviously useless, as a Chinese string uses 3 bytes per character, a Danish string uses 2 or 4 bytes per character, and a Russian string uses 4 bytes per character.

The only available encoding are UTF-8 and UTF-16. So what do I do?

The idea is this:

private static string UnicodeSubstring(string text, int length)
{
    var bytes = Encoding.UTF8.GetBytes(text);

    return Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, length));
}

But the length needs to be correctly dividable with the number of bytes used for each character, so the last character is always rendered correctly.

Jon Skeet · Accepted Answer

One option is to simply go through the string, computing the number of bytes for each character.

If you know you don't need to deal with characters outside the BMP, this is reasonably simple:

public string SubstringWithinUtf8Limit(string text, int byteLimit)
{
    int byteCount = 0;
    char[] buffer = new char[1];
    for (int i = 0; i < text.Length; i++)
    {
        buffer[0] = text[i];
        byteCount += Encoding.UTF8.GetByteCount(buffer);
        if (byteCount > byteLimit)
        {
            // Couldn't add this character. Return its index
            return text.Substring(0, i);
        }
    }
    return text;
}

It becomes slightly trickier if you have to handle surrogate pairs :(

Substring UTF-8 for both Latin, Chinese, Cyrillic, etc

Tags:

c#

.net

utf-8

Claus Jørgensen

1 Answers

Jon Skeet

Recent Activity

Donate For Us

Substring UTF-8 for both Latin, Chinese, Cyrillic, etc

Tags:

c#

.net

utf-8

Claus Jørgensen

1 Answers

Jon Skeet

Related questions

Recent Activity

Donate For Us