Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the Best UTF [closed]

Tags:

unicode

utf-8

utf

I'm really confused about UTF in Unicode.

there is UTF-8, UTF-16 and UTF-32.

my question is :

  1. what UTF that are support all Unicode blocks ?

  2. What is the best UTF(performance, size, etc), and why ?

  3. What is different between these three UTF ?

  4. what is endianness and byte order marks (BOM) ?

Thanks

like image 673
Ahmad Avatar asked Jul 30 '11 09:07

Ahmad


People also ask

Should I use UTF-8 or ISO 8859?

Most libraries that don't hold a lot of foreign language materials will be perfectly fine with ISO8859-1 ( also called Latin-1 or extended ASCII) encoding format, but if you do have a lot of foreign language materials you should choose UTF-8 since that provides access to a lot more foreign characters.

What is the best character encoding?

As of October 2022, UTF-8 accounts for on average 97.9% of all web pages (and 990 of the top 1,000 highest ranked web pages, the next most popular encoding, ISO-8859-1, is used by 5 of those sites). UTF-8 includes ASCII as a subset; almost no websites declare only ASCII used.

What is the difference between UTF-8 and utf8mb4?

The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.

What UTF-8 means?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.


2 Answers

what UTF that are support all Unicode blocks ?

All UTF encodings support all Unicode blocks - there is no UTF encoding that can't represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.

What is the best UTF(performance, size, etc), and why ?

For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).

What is different between these three UTF ?

UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.

UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called 'surrogate pairs'.

UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that U+10FFFF is encoded as 0x0010FFFF.

what is endianness and byte order marks (BOM) ?

Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.

In a little-endian encoding or system, the 32-bit value 0x12345678 is stored or transmitted as 0x78 0x56 0x34 0x12. In a big-endian encoding or system, it is stored or transmitted as 0x12 0x34 0x56 0x78.

A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way -- U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with 0xFF 0xFE, it can be assumed that the rest of the file is stored in a little-endian byte ordering.

A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.

Benefits of UTF-8

  • ASCII is a subset of the UTF-8 encoding and therefore is a great way to introduce ASCII text into a 'Unicode world' without having to do data conversion
  • UTF-8 text is the most compact format for ASCII text
  • Valid UTF-8 can be sorted on byte values and result in sorted codepoints

Benefits of UTF-16

  • UTF-16 is easier than UTF-8 to decode, even though it is a variable-length encoding
  • UTF-16 is more space-efficient than UTF-8 for characters in the BMP, but outside ASCII

Benefits of UTF-32

  • UTF-32 is not variable-length, so it requires no special logic to decode
like image 106
Delan Azabani Avatar answered Dec 23 '22 20:12

Delan Azabani


“Answer me these questions four, as all were answered long before.”

You really should have asked one question, not four. But here are the answers.

  1. All UTF transforms by definition support all Unicode code points. That is something you needn’t worry about. The only problem is that some systems are really UCS-2 yet claim they are UTF-16, and UCS-2 is severely broken in several fundamental ways:

    • UCS-2 is not a valid Unicode encoding.
    • UCS-2 supports only ¹⁄₁₇ᵗʰ of Unicode. That is, Plane 0 only, not Planes 1–16.
    • UCS-2 permits code points that The Unicode Standard guarantees will never be in a valid Unicode stream. These include
      • all 2,048 UTF-16 surrogates, code points U+D800 through U+DFFF
      • the 32 non-character code points between U+FDD0 and U+FDEF
      • both sentinels at U+FFEF and U+FFFF

    For what encoding is used internally by seven different programming languages, see slide 7 on Feature Support Summary in my OSCON talk from last week entitled “Unicode Support Shootout”. It varies a great deal.

  2. UTF-8 is the best serialization transform of a stream of logical Unicode code points because, in no particular order:

    • UTF-8 is the de facto standard Unicode encoding on the web.
    • UTF-8 can be stored in a null-terminated string.
    • UTF-8 is free of the vexing BOM issue.
    • UTF-8 risks no confusion of UCS-2 vs UTF-16.
    • UTF-8 compacts mainly-ASCII text quite efficiently, so that even Asian texts that are in XML or HTML often wind up being smaller in bytes than UTF-16. This is an important thing to know, because it is a counterintuitive and surprising result. The ASCII markup tags often make up for the extra byte. If you are really worried about storage, you should be using proper text compression, like LZW and related algorithms. Just bzip it.
    • If need be, it can be roped into use for trans-Unicodian points of arbitrarily large magnitude. For example, MAXINT on a 64-bit machine becomes 13 bytes using the original UTF-8 algorithm. This property is of rare usefulness, though, and must be used with great caution lest it be mistaken for a legitimate UTF-8 stream.

    I use UTF-8 whenever I can get away with it.

  3. I have already given properties of UTF-8, so here are some for the other two:

    • UTF-32 enjoys a singular advantage for internal storage: O(1) access to code point N. That is, constant time access when you need random access. Remember we lived forever with O(N) access in C’s strlen function, so I am not sure how important this is. My impression is that we almost always process our strings in sequential not random order, in which case this ceases to be a concern. Yes, it takes more memory, but only marginally so in the long run.
    • UTF-16 is a terrible format, having all the disadvantages of UTF-8 and UTF-32 but none of the advantages of either. It is grudgingly true that when properly handled, UTF-16 can certainly be made to work, but doing so takes real effort, and your language may not be there to help you. Indeed, your language is probably going to work against you instead. I’ve worked with UTF-16 enough to know what a royal pain it is. I would stay clear of both these, especially UTF-16, if you possibly have any choice in the matter. The language support is almost never there, because there are massive pods of hysterical porpoises all contending for attention. Even when proper code-point instead of code-unit access mechanisms exist, these are usually awkward to use and lengthy to type, and they are not the default. This leads too easily to bugs that you may not catch until deployment; trust me on this one, because I’ve been there.

    That’s why I’ve come to talk about there being a UTF-16 Curse. The only thing worse than The UTF-16 Curse is The UCS-2 Curse.

  4. Endianness and the whole BOM thing are problems that curse both UTF-16 and UTF-32 alike. If you use UTF-8, you will not ever have to worry about these.

I sure do hope that you are using logical (that is, abstract) code points internally with all your APIs, and worrying about serialization only for external interchange alone. Anything that makes you get at code units instead of code points is far far more hassle than it’s worth, no matter whether those code units are 8 bits wide or 16 bits wide. You want a code-point interface, not a code-unit interface. Now that your API uses code points instead of code units, the actual underlying representation no longer matters. It is important that this be hidden.


Category Errors

Let me add that everyone talking about ASCII versus Unicode is making a category error. Unicode is very much NOT “like ASCII but with more characters.” That might describe ISO 10646, but it does not describe Unicode. Unicode is not merely a particular repertoire but rules for handling them. Not just more characters, but rather more characters that have particular rules accompanying them. Unicode characters without Unicode rules are no longer Unicode characters.

If you use an ASCII mindset to handle Unicode text, you will get all kinds of brokenness, again and again. It doesn’t work. As just one example of this, it is because of this misunderstanding that the Python pattern-matching library, re, does the wrong thing completely when matching case-insensitively. It blindly assumes two code points count as the same if both have the same lowercase. That is an ASCII mindset, which is why it fails. You just cannot treat Unicode that way, because if you do you break the rules and it is no longer Unicode. It’s just a mess.

For example, Unicode defines U+03C3 GREEK SMALL LETTER SIGMA and U+03C2 GREEK SMALL LETTER FINAL SIGMA as case-insensitive versions of each other. (This is called Unicode casefolding.) But since they don’t change when blindly mapped to lowercase and compared, that comparison fails. You just can’t do it that way. You can’t fix it in the general case by switching the lowercase comparison to an uppercase one, either. Using casemapping when you need to use casefolding belies a shakey understanding of the whole works.

(And that’s nothing: Python 2 is broken even worse. I recommend against using Python 2 for Unicode; use Python 3 if you want to do Unicode in Python. For Pythonistas, the solution I recommend for Python’s innumerably many Unicode regex issues is Matthew Barnett’s marvelous regex library for Python 2 and Python 3. It is really quite neat, and it actually gets Unicode casefolding right — amongst many other Unicode things that the standard re gets miserably wrong.)

REMEMBER: Unicode is not just more characters: Unicode is rules for handling more characters. One either learns to work with Unicode, or else one works against it, and if one works against it, then it works against you.

like image 37
tchrist Avatar answered Dec 23 '22 22:12

tchrist