.NET String object and invalid Unicode code points

Question

Is it possible that a .NET String object will contain an invalid Unicode code point?

If yes, how this could happen (and how can I determine if the string has such invalid chars)?

Andrei Bozantan · Accepted Answer

Although the response given by @DPenner is excellent (and I used it as a starting point), I want to give some other details. Beside the orphaned surrogates which I think that are a clear sign of an invalid string, there is always the possibility that a string contains unassigned code points, and this case can't be treated like an error by the .NET Framework, since new characters are always added to the Unicode standard, see for example the versions of Unicode http://en.wikipedia.org/wiki/Unicode#Versions. And, in order to make things more clear, this call Char.GetUnicodeCategory(Char.ConvertFromUtf32(0x1F01C), 0); returns UnicodeCategory.OtherNotAssigned when using .NET 2.0, but it will return UnicodeCategory.OtherSymbol when using .NET 4.0.

Besides this, there is another interesting point: not even the .NET class library methods agree on how to handle the Unicode non-characters and the unpaired surrogate characters. For example:

unpaired surrogate char
- System.Text.Encoding.Unicode.GetBytes("\uDDDD"); - returns { 0xfd, 0xff} the encoding for the Replacement character, that is, the data is considered as invalid.
- "\uDDDD".Normalize(); - throws an exception with the message "Invalid Unicode code point found at index 0.", that is, the data is considered as invalid.
noncharacter code points
- System.Text.Encoding.Unicode.GetBytes("\uFFFF"); - returns {0xff, 0xff}, that is, the data is considered as valid.
- "\uFFFF".Normalize(); - throws an exception with the message "Invalid Unicode code point found at index 0.", that is the data is considered as invalid.

Below is a method which will search for invalid chars in a string:

/// <summary>
/// Searches invalid charachters (non-chars defined in Unicode standard and invalid surrogate pairs) in a string
/// </summary>
/// <param name="aString"> the string to search for invalid chars </param>
/// <returns>the index of the first bad char or -1 if no bad char is found</returns>
static int FindInvalidCharIndex(string aString)
{
    int ch;
    int chlow;

    for (int i = 0; i < aString.Length; i++)
    {
        ch = aString[i];
        if (ch < 0xD800) // char is up to first high surrogate
        {
            continue;
        }
        if (ch >= 0xD800 && ch <= 0xDBFF)
        {
            // found high surrogate -> check surrogate pair
            i++;
            if (i == aString.Length)
            {
                // last char is high surrogate, so it is missing its pair
                return i - 1;
            }

            chlow = aString[i];
            if (!(chlow >= 0xDC00 && chlow <= 0xDFFF))
            {
                // did not found a low surrogate after the high surrogate
                return i - 1;
            }

            // convert to UTF32 - like in Char.ConvertToUtf32(highSurrogate, lowSurrogate)
            ch = (ch - 0xD800) * 0x400 + (chlow - 0xDC00) + 0x10000;
            if (ch > 0x10FFFF)
            {
                // invalid Unicode code point - maximum excedeed
                return i;
            }
            if ((ch & 0xFFFE) == 0xFFFE)
            {
                // other non-char found
                return i;
            }
            // found a good surrogate pair
            continue;
        }

        if (ch >= 0xDC00 && ch <= 0xDFFF)
        {
            // unexpected low surrogate
            return i;
        }

        if (ch >= 0xFDD0 && ch <= 0xFDEF)
        {
            // non-chars are considered invalid by System.Text.Encoding.GetBytes() and String.Normalize()
            return i;
        }

        if ((ch & 0xFFFE) == 0xFFFE)
        {
            // other non-char found
            return i;
        }
    }

    return -1;
}

DPenner1 · Answer

Yes, it is possible. According to Microsoft's documentation, a .NET String is simply

A String object is a sequential collection of System.Char objects that represent a string.

while a .NET Char

Represents a character as a UTF-16 code unit.

Taken together, this means that a .NET String is just a sequence of UTF-16 code units, whether or not they are valid strings according to the Unicode standard. There are many ways this can occur, some of the more common ones I can think of are:

A non UTF-16 byte stream being mistakenly put into a String object without proper conversion.
A String object was split between a surrogate pair.
Someone purposely included such a String to test the system's robustness.

As a result, the following C# code is completely legal and will compile:

class Test
    static void Main(){
        string s = 
            "\uEEEE" + // A private use character
            "\uDDDD" + // An unpaired surrogate character
            "\uFFFF" + // A Unicode noncharacter
            "\u0888";  // A currently unassigned character       
        System.Console.WriteLine(s); // Output is highly console dependent
    }
}

.NET String object and invalid Unicode code points

Tags:

string

.net

unicode

Andrei Bozantan

2 Answers

Andrei Bozantan

DPenner1

Recent Activity

Donate For Us

.NET String object and invalid Unicode code points

Tags:

string

.net

unicode

Andrei Bozantan

2 Answers

Andrei Bozantan

DPenner1

Related questions

Recent Activity

Donate For Us