Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Utf8_general_ci or utf8mb4 or...?

utf16 or utf32? I'm trying to store content in a lot of languages. Some of the languages use double-wide fonts (for example, Japanese fonts are frequently twice as wide as English fonts). I'm not sure which kind of database I should be using. Any information about the differences between these four charsets...

like image 552
Wolfpack'08 Avatar asked Jul 18 '12 02:07

Wolfpack'08


People also ask

Should I use utf8mb4 or utf8?

The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.

What's the difference between utf8_general_ci and utf8_unicode_ci?

In short: utf8_unicode_ci uses the Unicode Collation Algorithm as defined in the Unicode standards, whereas utf8_general_ci is a more simple sort order which results in "less accurate" sorting results. If you don't care about correctness, then it's trivial to make any algorithm infinitely fast.

Should I use utf8mb4?

To save 4-byte-long UTF-8 characters in Mysql, you need to use the UTF8MB4 character set, but only 5.5. After 3 versions are supported (View version: Select version ();). I think that in order to get better compatibility, you should always use UTF8MB4 instead of UTF8.

What is difference between utf8mb3 and utf8mb4?

utf8mb3 supports only characters in the Basic Multilingual Plane (BMP). utf8mb4 additionally supports supplementary characters that lie outside the BMP. utf8mb3 uses a maximum of three bytes per character. utf8mb4 uses a maximum of four bytes per character.


1 Answers

MySQL's utf32 and utf8mb4 (as well as standard UTF-8) can directly store any character specified by Unicode; the former is fixed size at 4 bytes per character whereas the latter is between 1 and 4 bytes per character.

utf8mb3 and the original utf8 can only store the first 65,536 codepoints, which will cover CJVK (Chinese, Japanese, Vietnam, Korean), and use 1 to 3 bytes per character.

utf16 uses 2 bytes for the first 65,536 codepoints, and 4 bytes for everything else.

As for fonts, that's strictly a visual thing.

"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

See also MySQL documentation for Unicode support.

like image 139
Ignacio Vazquez-Abrams Avatar answered Sep 23 '22 02:09

Ignacio Vazquez-Abrams