From here <blockquote> Essentially, string uses the UTF-16 character encoding form </blockquote> But when saving vs StreamWriter : <blockquote> This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM), </blockquote> I've seen this sample (broken link removed): <img src="https://i.stack.imgur.com/QYDgv.jpg" alt="enter image description here"> And it looks like <code>utf8</code> is smaller for some strings while <code>utf-16</code> is smaller in some other strings. <ul> <li>So why does .net use <code>utf16</code> as default encoding for string and <code>utf8</code> for saving files?</li> </ul> Thank you. p.s. Ive already read the famous article

If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each. Consider the primitive type <code>char</code>. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32! Of course, we could use UTF-32 as the <code>char</code> representation, but UTF-8 in the <code>string</code> representation, converting as we go. The two disadvantages of UTF-16 are: <ul> <li>The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.</li> <li>For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.</li> </ul> (As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.) Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)

Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?

2 Answers

If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.

Consider the primitive type char. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!

Of course, we could use UTF-32 as the char representation, but UTF-8 in the string representation, converting as we go.

The two disadvantages of UTF-16 are:

The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.

(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)

Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)

119

answered Sep 19 '22 07:09

Jon Skeet

As with many "why was this chosen" questions, this was determined by history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. It wasn't until 1996 until Unicode acquired the supplementary planes to extend the coding space to a million codepoints. And surrogate pairs to fit them into a 16-bit encoding, thus setting the utf-16 standard.

.NET strings are utf-16 because that's an excellent fit with the operating system encoding, no conversion is required.

The history of utf-8 is murkier. Definitely past Windows NT, RFC-3629 dates from November 1993. It took a while to gain a foot-hold, the Internet was instrumental.

answered Sep 23 '22 07:09

Hans Passant

Related questions
                            
                                Are there any Fuzzy Search or String Similarity Functions libraries written for C#? [closed]
                            
                                Drawing SVG in .NET/C#? [closed]
                            
                                Deciding on when to use XmlDocument vs XmlReader
                            
                                Is it OK to update a production database with EF migrations?
                            
                                Why does the C# compiler remove a chain of method calls when the last one is conditional?
                            
                                How to make an installer for my C# application?
                            
                                How to block until an event is fired in c#
                            
                                Why C# behaves differently on two int array syntaxes
                            
                                Does the use of async/await create a new thread?
                            
                                Why are HashSets of structs with nullable values incredibly slow?
                            
                                IDE's for C# development on Linux?
                            
                                Moq'ing methods where Expression<Func<T, bool>> are passed in as parameters
                            
                                Using catch without arguments
                            
                                Is Int32.ToString() culture-specific?
                            
                                How can I lock a table on read, using Entity Framework?
                            
                                Admin rights for a single method
                            
                                Create object instance without invoking constructor?
                            
                                Does lock() guarantee acquired in order requested?
                            
                                Token Based Authentication in ASP.NET Core (refreshed)
                            
                                What does exclamation mark mean before invoking a method in C# 8.0? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?

Tags:

string

c#

.net

utf-8

utf-16

Royi Namir

People also ask

2 Answers

Jon Skeet

Hans Passant

Recent Activity

Donate For Us