Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Suggested character set for non utf8 columns in mysql

Tags:

guid

mysql

utf-8

Currently I'm using VARCHAR/TEXT with utf8_general_ci for all character columns in mysql. Now I want to improve database layout/performance.

What I figured out so far is to better use

  • CHAR instead of VARCHAR for fixed length columns as GUIDs or session ids
  • also use CHAR for small columns having length of 1 or maybe 2?

As I do not want to go as wide to save my GUIDs as BINARY(16) because of handling issues, I'd rather save them as CHAR(32) to especially improve keys. (I would even save 2/3 when switching from utf8 to some 1-byte-charset)

  • So what will the best character set be for such columns? ASCII? latin1? BINARY? Which collation?
  • What characterset/collation to use for other columns where I do not need utf8 support but need proper sorting. A Binary would fail?

Is it good practice to mix up different character sets in same mysql (innodb) table? Or do I get better performance when all columns have same charset within same table? Or even database?

like image 524
toshniba Avatar asked Dec 14 '22 20:12

toshniba


1 Answers

GUID/UUID/MD5/SHA1 are all hex and dash. For them

CHAR(..) CHARACTER SET ascii COLLATE ascii_general_ci

That will allow for A=a when comparing hex strings.

For Base64 things, use either of

CHAR(..) CHARACTER SET ascii COLLATE ascii_bin
BINARY(..)

since A is not semantically the same as a.

Further notes...

  • utf8 spits at you if you give it an invalid 8-bit value.
  • ascii spits at you for any 8-bit value.
  • latin1 accepts anything -- thereby your problems down the road
  • It is quite OK to have different columns in a table having different charsets and/or collations.
  • The charset/collation on the table is just a default, ripe for overriding at the column definition.
  • BINARY may be a tiny bit faster than any _bin collation, but not enough to notice.
  • Use CHAR for columns that are truly fixed length; don't mislead the user by using it for other cases.
  • %_bin is faster than %_general_ci, which is faster than other collations. Again, you would be hard-pressed to measure a difference.
  • Never use TINYTEXT or TINYBLOB.
  • For proper encoding, use the appropriate charset.
  • For "proper sorting", use the appropriate collation. See example below.
  • For "proper sorting" where multiple languages are represented, and you are using utf8mb4, use utf8mb4_unicode_520_ci (or utf8mb4_900_ci if using version 8.0). The 520 and 900 refer to Unicode standards; new collations are likely to come in the future.

If you are entirely in Czech, then consider these charsets and collations. I list them in preferred order:

mysql> show collation like '%czech%';
+------------------+---------+-----+---------+----------+---------+
| Collation        | Charset | Id  | Default | Compiled | Sortlen |
+------------------+---------+-----+---------+----------+---------+
| utf8mb4_czech_ci | utf8mb4 | 234 |         | Yes      |       8 | -- opens up the world
| utf8_czech_ci    | utf8    | 202 |         | Yes      |       8 | -- opens up most of the world
| latin2_czech_cs  | latin2  |   2 |         | Yes      |       4 | -- kinda like latin1

The rest are "useless":

| cp1250_czech_cs  | cp1250  |  34 |         | Yes      |       2 |
| ucs2_czech_ci    | ucs2    | 138 |         | Yes      |       8 |
| utf16_czech_ci   | utf16   | 111 |         | Yes      |       8 |
| utf32_czech_ci   | utf32   | 170 |         | Yes      |       8 |
+------------------+---------+-----+---------+----------+---------+
7 rows in set (0.00 sec)

More

  • The reason for using smaller datatypes (where appropriate) is to shrink the dataset, which leads to less I/O, which leads to things being more cacheable, which makes the program run faster. This is especially important for huge datasets; it is less important for small- or medium-sized datasets.
  • ENUM is 1 byte, yet acts like a string. So you get the "best of both worlds". (There are drawbacks, and there is a 'religious war' among advocates for ENUM vs TINYINT vs VARCHAR.)
  • Usually columns that are "short" are always the same length. A country_code is always 2 letters, always ascii, always could benefit from case insensitive collation. So CHAR(2) CHARACTER SET ascii COLLATE ascii_general_ci is optimal. If you have something that is sometimes 1-char, sometimes 2, then flip a coin; whatever you do won't make much difference.
  • VARCHAR (up to 255) has an extra 1-byte length attached to it. So, if your strings vary in length at all, VARCHAR is at least as good as CHAR. So simplify your brain processing: "variable length --> `VARCHAR".
  • BIT, depending on version, may be implemented as a 1-byte TINYINT UNSIGNED. If you have only a few bits in your table, it is not worth worrying about.
  • One of my Rules of Thumb says that if you aren't likely to get a 10% improvement, move on to some other optimization. Much of what we are discussing here is under 10% (space in this case). Still, get in the habit of thinking about it when writing CREATE TABLE. I often see tables with BIGINT and DOUBLE (each 8 bytes) that could easily use smaller columns. Sometimes saving more than 50% (space).
  • How does "space" translate into "speed". Tiny tables -> a tiny percentage. Huge tables -> In some cases 10x. (That's 10-fold, not 10%.) (UUIDs are one way to get really bad performance on huge tables.)

ENUM

  • Acts and feels like a string, yet takes only one byte. (One byte translates, indirectly, into a slight speed improvement.)
  • Practical when fewer than, say, 10 different values.
  • Impractical if frequently adding a new value -- requires ALTER TABLE, though it can be "inplace".
  • Suggest starting the list with 'unknown' (or something like that) and making the column NOT NULL (versus NULL).
  • The character set for the enum needs to be whatever is otherwise being used for the connection. The choice does not matter much unless you have options that collate equal (eg, A versus a).
like image 141
Rick James Avatar answered Jan 17 '23 02:01

Rick James