The text
library uses utf-16 internally. utf-8 is a more commonly used encoding, especially in C libraries. In addition, utf-8 uses less memory most of the time. Why does text
use utf-16?
UTF-16 is, obviously, more efficient for A) characters for which UTF-16 requires fewer bytes to encode than does UTF-8. UTF-8 is, obviously, more efficient for B) characters for which UTF-8 requires fewer bytes to encode than does UTF-16.
The main difference between UTF-8, UTF-16, and UTF-32 character encoding is how many bytes it requires to represent a character in memory. UTF-8 uses a minimum of one byte, while UTF-16 uses a minimum of 2 bytes.
Characters within the ASCII range take only one byte while very unusual characters take four. UTF-32 uses four bytes per character regardless of what character it is, so it will always use more space than UTF-8 to encode the same string.
As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. This greatly simplifies things.
There was a project to convert text
to using utf8 internally, because that's irrelevant to the API it provides. After completing enough to benchmark, the project was considered not an improvement and not integrated with the mainline at this time. There is a chance it will be in the future, if it can become a sufficient improvement. Here's the full story: http://jaspervdj.be/posts/2011-08-19-text-utf8-the-aftermath.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With