I'm trying to generate a random string in .NET and convert to bytes, and running into a little difficulty. I'd like the full set of possible characters, and my understanding is that a string can contain any character.
My code is currently as follows:
var plainText = new StringBuilder();
for (int j = 0; j < stringLength; ++j)
{
plainText.Append((char)_random.Next(char.MinValue, char.MaxValue));
}
byte[] x = Encoding.Unicode.GetBytes(plainText.ToString());
string result = Encoding.Unicode.GetString(x);
In theory, plainText
and result
should be identical. They're mostly the same, but some of the original characters are lost, seems to be characters in the 55000-57000 range - they're replaced with character 65533.
I'm assuming the problem is with my encoding, but I thought Unicode would handle this properly. I've tried UTF8 and UTF32, but those give me the same problem.
Any thoughts?
The problem is that the characters in the range 0xD800-0xDFFF (55296-57343), called Unicode surrogate characters, are not valid on their own. They must appear as a pair (0xD800-0xDBFF first, 0xDC00-0xDFFF second) in order to be valid (in the UTF-16 encoding scheme). Alone, they will be treated as invalid characters and decoded to 0xFFFD (65533). C# uses UTF-16 to represent its strings, so that's why you are seeing that output.
You can either choose to filter them out (e.g. calling _random.Next
until you get a non-surrogate character), or generate legal surrogate pairs whenever you generate a surrogate character.
Those are surrogate characters 55296-57343 (0xD800-0xDFFF). You need to pair them up correctly. A pair of surrogate characters in UTF-16 describes a single unicode codepoint.
You seem to operate on the assumption that a char and a code-point are the same thing. That's not true, there are >2^16 code-points.
I recommend reading the UTF-16 Wikipedia Article.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With