Generating a random string

Question

I'm trying to generate a random string in .NET and convert to bytes, and running into a little difficulty. I'd like the full set of possible characters, and my understanding is that a string can contain any character.

My code is currently as follows:

var plainText = new StringBuilder();
for (int j = 0; j < stringLength; ++j)
{
    plainText.Append((char)_random.Next(char.MinValue, char.MaxValue));
}
byte[] x = Encoding.Unicode.GetBytes(plainText.ToString());
string result = Encoding.Unicode.GetString(x);

In theory, plainText and result should be identical. They're mostly the same, but some of the original characters are lost, seems to be characters in the 55000-57000 range - they're replaced with character 65533.

I'm assuming the problem is with my encoding, but I thought Unicode would handle this properly. I've tried UTF8 and UTF32, but those give me the same problem.

Any thoughts?

nneonneo · Accepted Answer

The problem is that the characters in the range 0xD800-0xDFFF (55296-57343), called Unicode surrogate characters, are not valid on their own. They must appear as a pair (0xD800-0xDBFF first, 0xDC00-0xDFFF second) in order to be valid (in the UTF-16 encoding scheme). Alone, they will be treated as invalid characters and decoded to 0xFFFD (65533). C# uses UTF-16 to represent its strings, so that's why you are seeing that output.

You can either choose to filter them out (e.g. calling _random.Next until you get a non-surrogate character), or generate legal surrogate pairs whenever you generate a surrogate character.

CodesInChaos · Answer

Those are surrogate characters 55296-57343 (0xD800-0xDFFF). You need to pair them up correctly. A pair of surrogate characters in UTF-16 describes a single unicode codepoint.

You seem to operate on the assumption that a char and a code-point are the same thing. That's not true, there are >2^16 code-points.

I recommend reading the UTF-16 Wikipedia Article.

Generating a random string

Tags:

string

c#

.net

encoding

unicode

Joe Enos

2 Answers

nneonneo

CodesInChaos

Recent Activity

Donate For Us

Generating a random string

Tags:

string

c#

.net

encoding

unicode

Joe Enos

2 Answers

nneonneo

CodesInChaos

Related questions

Recent Activity

Donate For Us