Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating a random string

I'm trying to generate a random string in .NET and convert to bytes, and running into a little difficulty. I'd like the full set of possible characters, and my understanding is that a string can contain any character.

My code is currently as follows:

var plainText = new StringBuilder();
for (int j = 0; j < stringLength; ++j)
{
    plainText.Append((char)_random.Next(char.MinValue, char.MaxValue));
}
byte[] x = Encoding.Unicode.GetBytes(plainText.ToString());
string result = Encoding.Unicode.GetString(x);

In theory, plainText and result should be identical. They're mostly the same, but some of the original characters are lost, seems to be characters in the 55000-57000 range - they're replaced with character 65533.

I'm assuming the problem is with my encoding, but I thought Unicode would handle this properly. I've tried UTF8 and UTF32, but those give me the same problem.

Any thoughts?

like image 523
Joe Enos Avatar asked Aug 26 '12 05:08

Joe Enos


2 Answers

The problem is that the characters in the range 0xD800-0xDFFF (55296-57343), called Unicode surrogate characters, are not valid on their own. They must appear as a pair (0xD800-0xDBFF first, 0xDC00-0xDFFF second) in order to be valid (in the UTF-16 encoding scheme). Alone, they will be treated as invalid characters and decoded to 0xFFFD (65533). C# uses UTF-16 to represent its strings, so that's why you are seeing that output.

You can either choose to filter them out (e.g. calling _random.Next until you get a non-surrogate character), or generate legal surrogate pairs whenever you generate a surrogate character.

like image 200
nneonneo Avatar answered Sep 17 '22 23:09

nneonneo


Those are surrogate characters 55296-57343 (0xD800-0xDFFF). You need to pair them up correctly. A pair of surrogate characters in UTF-16 describes a single unicode codepoint.

You seem to operate on the assumption that a char and a code-point are the same thing. That's not true, there are >2^16 code-points.

I recommend reading the UTF-16 Wikipedia Article.

like image 29
CodesInChaos Avatar answered Sep 20 '22 23:09

CodesInChaos