From here
Essentially, string uses the UTF-16 character encoding form
But when saving vs StreamWriter :
This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM),
I've seen this sample (broken link removed):
And it looks like utf8
is smaller for some strings while utf-16
is smaller in some other strings.
utf16
as default encoding for string and utf8
for saving files?Thank you.
p.s. Ive already read the famous article
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits. Main UTF-8 pros: Basic ASCII characters like digits, Latin characters with no accents, etc.
. NET provides encoding classes that encode and decode text by using various encoding systems. For example, the UTF8Encoding class describes the rules for encoding to, and decoding from, UTF-8. . NET uses UTF-16 encoding (represented by the UnicodeEncoding class) for string instances.
If your data is mostly in western languages and you want to reduce the amount of storage needed, go with UTF-8 as for those languages it will take about half the storage of UTF-16.
Why did UTF-8 replace the ASCII character-encoding standard? UTF-8 can store a character in more than one byte. UTF-8 replaced the ASCII character-encoding standard because it can store a character in more than a single byte. This allowed us to represent a lot more character types, like emoji.
If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.
Consider the primitive type char
. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!
Of course, we could use UTF-32 as the char
representation, but UTF-8 in the string
representation, converting as we go.
The two disadvantages of UTF-16 are:
(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)
Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)
As with many "why was this chosen" questions, this was determined by history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. It wasn't until 1996 until Unicode acquired the supplementary planes to extend the coding space to a million codepoints. And surrogate pairs to fit them into a 16-bit encoding, thus setting the utf-16 standard.
.NET strings are utf-16 because that's an excellent fit with the operating system encoding, no conversion is required.
The history of utf-8 is murkier. Definitely past Windows NT, RFC-3629 dates from November 1993. It took a while to gain a foot-hold, the Internet was instrumental.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With