Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does using ASCII/Latin Charset speed up the database?

It would seem that using the ASCII charset for most fields and then specify utf8 only for the fields that need it would reduce the amount of I/O the database must perform by 100%.

Anyone know if this is true?

Update: The above was not really my question. I should have said: use Latin for the default character set and then only specify utf8mb4 only for the fields that need it. The thinking being that: using 1 byte vs 2 bytes should improve I/O by 100%. Sorry for the confusion.

like image 809
mbalsam Avatar asked Jul 23 '18 23:07

mbalsam


People also ask

Which is the best collation for MySQL?

If you're using MySQL 5.7, the default MySQL collation is generally latin1_swedish_ci because MySQL uses latin1 as its default character set. If you're using MySQL 8.0, the default charset is utf8mb4. If you elect to use UTF-8 as your collation, always use utf8mb4 (specifically utf8mb4_unicode_ci).

What is the difference between UTF-8 and latin1?

what is the difference between utf8 and latin1? They are different encodings (with some characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters.

What is the difference between UTF-8 and utf8mb4?

The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.

What is the default collation for MySQL?

The default MySQL server character set and collation are latin1 and latin1_swedish_ci , but you can specify character sets at the server, database, table, column, and string literal levels.


2 Answers

Short Answer: Not worth worrying about.

Long Answer:

Two issues:

  • Speed:

Comparing two encodings with the corresponding _bin (ascii_bin or utf8_bin) COLLATION is as simple as comparing the bytes -- so no significant difference. Other collations can differ, with ascii being faster. But the difference is insignificant compared to the effort of fetching rows, etc.

  • Space:

Ascii is a subset of utf8. utf8 stores only 1 byte for each ascii character, just as ascii does. So, no space difference. (Accented letters in Western Europe need either 1-byte latin1 or 2-byte utf8; hence incompatible and different in size.) Space leads to caching, which leads to a slight difference in performance.

For English text, 0% savings. For European, latin1 would save only a few percent; For most the rest of the world, utf8 are the only viable solution. For Chinese and Emoji, utf8mb4 is a must.

  • Temp tables

In certain situations, the space consumed by a string expands to the potential max. country_code CHAR(2) CHARACTER SET ... will take 2 bytes for ascii; 6 bytes for utf8.

Bottom Line:

Use ascii for country codes, hex, postal codes, uuids, md5s etc. If you are going international, and/or need Emoji, then make your "strings" utf8mb4. But do it because it is 'right', not because you will get magically marvelously much more speed; you won't. And do it whenever you create a table; it's the pits to change it later.

like image 181
Rick James Avatar answered Nov 22 '22 23:11

Rick James


@RickJames is right, you should not worry about saving space by choosing ASCII or utf8 over utf8mb4.

utf8 and utf8mb4 are variable-length character encodings. This table from wikipedia illustrates how characters automatically take 1, 2, 3, or 4 bytes each, depending on the value encoded. If the high bit of a byte is set, then the character uses an additional byte, up to 4 bytes.

enter image description here The wikipedia article explains it clearly:

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

You don't have to do anything to choose single-byte versus multi-byte mode. This is just the way the encoding works. Each character automatically uses the number of bytes it needs, and no more.

So there is no advantage to using utf8 over utf8mb4, and no advantage of using ASCII over either, unless you need to restrict the characters allowed in a string.

For what it's worth, the character set MySQL calls "utf8" is an alias for utf8mb3, an implementation of just the first three bytes of the UTF8 encoding. The MySQL server team blog (https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/) says that utf8mb4 is faster, at least given performance improvements in MySQL 8.0, and utf8mb3 should be considered deprecated. MySQL 8.0.11 release notes say that utf8 will be redefined as an alias for utf8mb4 in some future version of MySQL.

like image 23
Bill Karwin Avatar answered Nov 22 '22 23:11

Bill Karwin