Is it possible that a .NET String object will contain an invalid Unicode code point?
If yes, how this could happen (and how can I determine if the string has such invalid chars)?
Although the response given by @DPenner is excellent (and I used it as a starting point), I want to give some other details.
Beside the orphaned surrogates which I think that are a clear sign of an invalid string, there is always the possibility that a string contains unassigned code points, and this case can't be treated like an error by the .NET Framework, since new characters are always added to the Unicode standard, see for example the versions of Unicode http://en.wikipedia.org/wiki/Unicode#Versions. And, in order to make things more clear, this call Char.GetUnicodeCategory(Char.ConvertFromUtf32(0x1F01C), 0);
returns UnicodeCategory.OtherNotAssigned
when using .NET 2.0, but it will return UnicodeCategory.OtherSymbol
when using .NET 4.0.
Besides this, there is another interesting point: not even the .NET class library methods agree on how to handle the Unicode non-characters and the unpaired surrogate characters. For example:
System.Text.Encoding.Unicode.GetBytes("\uDDDD");
- returns { 0xfd, 0xff}
the encoding for the Replacement character, that is, the data is considered as invalid."\uDDDD".Normalize();
- throws an exception with the message "Invalid Unicode code point found at index 0.", that is, the data is considered as invalid.System.Text.Encoding.Unicode.GetBytes("\uFFFF");
- returns {0xff, 0xff}
, that is, the data is considered as valid."\uFFFF".Normalize();
- throws an exception with the message "Invalid Unicode code point found at index 0.", that is the data is considered as invalid.Below is a method which will search for invalid chars in a string:
/// <summary>
/// Searches invalid charachters (non-chars defined in Unicode standard and invalid surrogate pairs) in a string
/// </summary>
/// <param name="aString"> the string to search for invalid chars </param>
/// <returns>the index of the first bad char or -1 if no bad char is found</returns>
static int FindInvalidCharIndex(string aString)
{
int ch;
int chlow;
for (int i = 0; i < aString.Length; i++)
{
ch = aString[i];
if (ch < 0xD800) // char is up to first high surrogate
{
continue;
}
if (ch >= 0xD800 && ch <= 0xDBFF)
{
// found high surrogate -> check surrogate pair
i++;
if (i == aString.Length)
{
// last char is high surrogate, so it is missing its pair
return i - 1;
}
chlow = aString[i];
if (!(chlow >= 0xDC00 && chlow <= 0xDFFF))
{
// did not found a low surrogate after the high surrogate
return i - 1;
}
// convert to UTF32 - like in Char.ConvertToUtf32(highSurrogate, lowSurrogate)
ch = (ch - 0xD800) * 0x400 + (chlow - 0xDC00) + 0x10000;
if (ch > 0x10FFFF)
{
// invalid Unicode code point - maximum excedeed
return i;
}
if ((ch & 0xFFFE) == 0xFFFE)
{
// other non-char found
return i;
}
// found a good surrogate pair
continue;
}
if (ch >= 0xDC00 && ch <= 0xDFFF)
{
// unexpected low surrogate
return i;
}
if (ch >= 0xFDD0 && ch <= 0xFDEF)
{
// non-chars are considered invalid by System.Text.Encoding.GetBytes() and String.Normalize()
return i;
}
if ((ch & 0xFFFE) == 0xFFFE)
{
// other non-char found
return i;
}
}
return -1;
}
Yes, it is possible. According to Microsoft's documentation, a .NET String is simply
A String object is a sequential collection of System.Char objects that represent a string.
while a .NET Char
Represents a character as a UTF-16 code unit.
Taken together, this means that a .NET String is just a sequence of UTF-16 code units, whether or not they are valid strings according to the Unicode standard. There are many ways this can occur, some of the more common ones I can think of are:
As a result, the following C# code is completely legal and will compile:
class Test
static void Main(){
string s =
"\uEEEE" + // A private use character
"\uDDDD" + // An unpaired surrogate character
"\uFFFF" + // A Unicode noncharacter
"\u0888"; // A currently unassigned character
System.Console.WriteLine(s); // Output is highly console dependent
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With