Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "The .NET framework uses the UTF-16 encoding standard by default" mean?

My study guide (for 70-536 exam) says this twice in the text and encoding chapter, which is right after the IO chapter.

All the examples so far are to do with simple file access using FileStream and StreamWriter.

It aslo says stuff like "If you don't know what encoding to use when you create a file, don't specify one and .NET will use UTF16" and "Specify different encodings using Stream constructor overloads".

Never mind the fact that the actual overloads are on the StreamWriter class but hey, whatever.

I am looking at StreamWriter right now in reflector and I am certain I can see that the default is actaully UTF8NoBOM.

But none of this is listed in the errata. It's an old book (cheked the errat of both editions) so if it was wrong I would have thought someone had picked up on it.....

Makes me think maybe I didn't understand it.

So.....any ideas what it is talking about? Some other place where there is a default?

It's just totally confused me.

like image 902
J M Avatar asked Mar 23 '09 23:03

J M


People also ask

What is UTF-16 encoding?

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.

What is the purpose of UTF encoding?

UTF-8 is currently the most popular encoding method on the internet because it can efficiently store text containing any character. UTF-16 is another encoding method, but is less efficient for storing text files (except for those written in certain non-English languages).

Why is UTF-16 needed?

UTF-16 allows all of the basic multilingual plane (BMP) to be represented as single code units. Unicode code points beyond U+FFFF are represented by surrogate pairs. The interesting thing is that Java and Windows (and other systems that use UTF-16) all operate at the code unit level, not the Unicode code point level.

What is the default encoding of strings in net?

All string functions in Windows use UTF-16 and have for years.


1 Answers

“UTF-16” is an annoying term, as it has two meanings which are easily confused.

The first meaning is a series of 16-bit codepoints. Most of these correspond directly to the Unicode character of the same number; characters outside the Basic Multilingual Plane (U+10000 upwards) are stored as two 16-bit codepoints, each one of the Surrogates.

Many languages use UTF-16 in this sense for internal storage purposes, including as a native string type. This is the usual source of phrases like “.NET (or Java) uses UTF-16 as its default encoding”. .NET is accessing the elements of such a UTF-16 string 16 bits at a time (ie, at the implementation level, as a uint16).

The next thing to consider is the encoding of such a UTF-16 string into linear bytes, for storage in a file or network stream. As always when you store larger numbers into bytes, there are two possible encodings: little-endian or big-endian. So you can use “UTF-16LE”, the little-endian encoding of UTF-16 into bytes, or “UTF-16BE”, the big-endian encoding.

(“UTF-16LE” is the more commonly used. Just to add more confusion to the flames, Windows gives it the deeply misleading and ambiguous encoding name “Unicode”. In reality it is almost always better to use UTF-8 for file storage and network streams than either of UTF-16LE/BE.)

But if you don't know whether a bunch of bytes contains “UTF-16LE” or “UTF-16BE”, you can use the trick of looking at the first code point to work it out. This code point, the Byte Order Mark (BOM), is only valid when read one way around, so you can't mistake one encoding for the other.

This approach, of not caring what byte order you have but using a BOM to signal it, is usually referred to under the encoding name... “UTF-16”.

So, when someone says “UTF-16”, you can't tell whether they mean a sequence of short-int Unicode code points, or a sequence of bytes in unspecified order that will decode to one.

(“UTF-32” has the same problem.)

If you don't know what encoding to use when you create a file, don't specify one and .NET will use UTF16

If that's the actual direct quote it is a lie. Constructing a StreamWriter without an encoding argument is explicitly specified to give you UTF-8.

like image 146
bobince Avatar answered Oct 05 '22 17:10

bobince