Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the internal format of a .NET String?

I'm making some pretty string-manipulation-intensive code in C#.NET and got curious about some Joel Spolsky articles I remembered reading a while back:

http://www.joelonsoftware.com/articles/fog0000000319.html
http://www.joelonsoftware.com/articles/Unicode.html

So, how does .NET do it? Two bytes per char? There ARE some Unicode chars^H^H^H^H^H code points that need more than that. And how is the length encoded?

like image 266
JCCyC Avatar asked Jun 19 '09 16:06

JCCyC


2 Answers

.NET uses UTF-16.

From System.String on MSDN:

"Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object."

like image 191
Reed Copsey Avatar answered Sep 28 '22 08:09

Reed Copsey


The String object is pretty complicated to provide a short example and encode a given text into a string, showing the resulting memory content as a sequence of byte values.

A String object represents text as a sequence of UTF-16 code units. It is a sequential collection of System.Char objects, each of which corresponds to a UTF-16 code unit. A single Char object usually represents a single code point. A code point might require more than one encoded element, ie. more than one Char object (supplementary code points (or surrogate pairs) and graphemes). Note: UTF-16 is a variable-width encoding.

The length of the string is stored in memory as a property of the String object. Note: a String object can include embedded null characters, which count as a part of the string's length (as opposed to C and C++, where a null character indicates the end of a string so the length does not have to be stored additionally). The internal character array, storing the Char objects, can actually be longer than the length of the string (resulting of the allocation strategy).

If you struggle to create the right encoding to work with (because you cannot find any property called System.Text.Encoding.UTF16), UTF-16 is actually System.Text.Encoding.Unicode, as used in this example:

string unicodeString = "pi stands for \u03a0";
byte[] encoded = System.Text.Encoding.Unicode.GetBytes(unicodeString);

The constructor Encoding.Unicode, without any parameters, actually creates a UnicodeEncoding object using the little endian byte order. The UnicodeEncoding class (that implements the UTF-16 encoding) is capable of handling big endian as well (also supports the handling of a byte order mark). The native byte order of the Intel platform is little endian, so it is probably more efficient for .NET (and Windows) to store Unicode strings in this format.

like image 42
booFar Avatar answered Sep 28 '22 10:09

booFar