Why does anyone use an encoding other than UTF-8? [closed]

People also ask

What are the 3 types of character encoding?

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32.

What is better than UTF-8?

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters. UTF-32 will cover all possible characters in 4 bytes.

What are the 2 most popular character encoding?

The most common ones being windows 1252 and Latin-1 (ISO-8859).

Wikipedia lists advantages and disadvantages of UTF-8 as compared to a variety of other encodings:

http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages

The most important disadvantages are IMHO that UTF-8 might use significantly more space especially in Asian languages such as Chinese, Japanese or Hindi and that not all code points have the same size which makes measurements more difficult and many string operations such as search inefficient.

Well, some do it because their tools are archaic or flawed. Some do it because they don't see a need to support anything other than ASCII. Some do it because they don't know any better.

Those are the usual excuses for not using Unicode.

As for not using UTF-8 specifically there are different reasons. Some systems, like Windows¹ (and stemming from that, .NET) and Java came to be in a time where Unicode was a strict 16-bit code. Therefore, there was really only one encoding: UCS-2, encoding code points directly as 16-bit words.

Later Unicode was expanded to 21 bits because 65536 code points weren't enough anymore. This caused encodings such as UTF-32 and UTF-16 to appear. For systems previously working with UCS-2 the transition to UTF-16 was the easiest and most sensible choice. Windows did that transition back in Ye Olde Days of Windows 2000.

So while I think that nearly all application nowadays should support Unicode I don't think it is entirely necessary for them to specifically use UTF-8. There are historic reasons for that and no real benefit in converting existing systems from UTF-16 to UTF-8.

¹ NT.

In UTF-8 code points between 0800 and FFFF take up three bytes in UTF-8 but only two in UTF-16. See the wikipedia comparison for more details, but basically if text heavily uses code points in this range (say, if it's Chinese), UTF-8 files will be larger than UTF-16 files with the same content.

UTF-8 is very efficient at encoding plain English text (same as ASCII). If your user base is likely to be mostly, say, Chinese, you will be much better off using UTF-16.

For more information, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

Because outside the English-speaking world, people have been using various encodings that predate Unicode and are tailored for their respective languages for decades. These language-specific encodings have become ingrained everywhere and are pretty much a standard. If you want to have any hope of interfacing with legacy systems, you have to use them, so all systems have to support them and usually use them as default even if they by now support UTF-8 as well. There may even be multiple legacy encodings traditionally used for different purposes.

Examples:

ISO-8859-1 in western Europe - actually outdated there as well, as you need ISO-8859-15 for the Euro sign
ISO-2022-JP in Japan for emails, Shift JIS for websites
Big5 in Taiwan
GB2312 in China

The last two examples show that encodings can even be a political issue.

Sometimes they are restricted due to historical/unsupported reasons (I'm developing on Windows using Zend Studio on a Samba share on a Linux box: and something in that mix means I keep reverting to Cp1512 instead of UTF8).

Sometimes you don't need to use UTF-8 (for example when storing a md5 hash in a database: you only need the hexadecimal range 0-9 A-F: why make it a UTF-8 field, which will take at least a byte extra storage instead of normal ASCII).

Sometimes it's just laziness learning the UTF-8 functions for a particular language.

Related questions
                            
                                How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?
                            
                                Multibyte trim in PHP?
                            
                                setting a UTF-8 in java and csv file [duplicate]
                            
                                How to protect against diacritics such as Zalgo text
                            
                                What is the encoding of argv?
                            
                                Am I correctly supporting UTF-8 in my PHP apps?
                            
                                How does a Unicode character get mapped to a glyph in a font?
                            
                                How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?
                            
                                urllib.quote() throws KeyError
                            
                                Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL?
                            
                                How to pickle and unpickle to portable string in Python 3
                            
                                What does _T stands for in a CString
                            
                                What characters are grouped with Array.from?
                            
                                Java Unicode encoding
                            
                                Complete set of punctuation marks for Python (not just ASCII)
                            
                                Why does the Java char primitive take up 2 bytes of memory?
                            
                                Color in the Unicode standard?
                            
                                How to use unicode in Android resource?
                            
                                Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?
                            
                                byte string vs. unicode string. Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does anyone use an encoding other than UTF-8? [closed]

Tags:

encoding

unicode

utf-8

People also ask

Recent Activity

Donate For Us