Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How much UTF-8 text fits in a MySQL "Text" field?

Tags:

mysql

utf-8

According to MySQL, a text column holds 65,535 bytes.

So if this a legitimate boundary then will it actually only fit about 32k UTF-8 characters, right? Or is this one of those "fuzzy" boundaries where the guys that wrote the docs can't tell characters from bytes and it will actually allow ~64k UTF-8 characters if set to something like utf8_general_ci?

like image 265
Xeoncross Avatar asked Dec 12 '10 02:12

Xeoncross


People also ask

What is the limit of TEXT field in MySQL?

A TEXT column with a maximum length of 16,777,215 (224 − 1) characters. The effective maximum length is less if the value contains multibyte characters. Each MEDIUMTEXT value is stored using a 3-byte length prefix that indicates the number of bytes in the value.

What is the size of TEXT in MySQL?

TEXT: 65,535 characters - 64 KB The standard TEXT data object is sufficiently capable of handling typical long-form text content. TEXT data objects top out at 64 KB (expressed as 2^16 -1) or 65,535 characters and requires a 2 byte overhead.

What is the range of UTF-8?

UTF-8 Basics. UTF-8 (Unicode Transformation–8-bit) is an encoding defined by the International Organization for Standardization (ISO) in ISO 10646. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.

How many possible UTF-8 characters are there?

UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.


3 Answers

A text column can be up to 65,535 bytes.

An utf-8 character can be up to 3 bytes.

So... your actual limit can be 21,844 characters.

See the manual for more info: http://dev.mysql.com/doc/refman/5.1/en/string-type-overview.html

A variable-length string. M represents the maximum column length in characters. The range of M is 0 to 65,535. The effective maximum length of a VARCHAR is subject to the maximum row size (65,535 bytes, which is shared among all columns) and the character set used. For example, utf8 characters can require up to three bytes per character, so a VARCHAR column that uses the utf8 character set can be declared to be a maximum of 21,844 characters.

like image 84
Wolph Avatar answered Oct 03 '22 21:10

Wolph


UTF-8 characters can take up to 4 bytes each, not 2 as you are supposing. UTF-8 is a variable-width encoding, depending on the number of significant bits in the Unicode code point:

  • 7 bits and under in the Unicode code point: 1 byte in UTF-8
  • 8 to 11 bits: 2 bytes in UTF-8
  • 12 to 16 bits: 3 bytes
  • 17 to 21 bits: 4 bytes

The original UTF-8 spec allows encoding up to 31-bit Unicode values, taking as many as 6 bytes to encode in UTF-8 form. After UTF-8 became popular, the Unicode Consortium declared that they will never use code points beyond 221 - 1. This is now standardized as RFC 3629.

MySQL currently (i.e. version 5.6) only supports the Unicode Basic Multilingual Plane characters, for which UTF-8 needs up to 3 bytes per character. That means the current answer to your question is that your TEXT field can hold at least 21,844 characters.

Depending on how you look at it, the actual limits are higher or lower than that:

  • If you assume, as I do, that the BMP limitation will eventually be lifted in MySQL or one of its forks, you shouldn't count on being able to store more than 16,383 characters in that field if your MySQL client allows arbitrary Unicode text input.

  • On the other hand, you may be able to exploit the fact that UTF-8 is a variable width encoding. If you know your text is mostly plain English with just the occasional non-ASCII character, your effective in-practice limit could approach the maximum 64 KB - 1 character limit.

like image 40
Warren Young Avatar answered Oct 03 '22 20:10

Warren Young


However, when used as primary key, MySQL assumes that each limit of column's size adds 3 bytes to key.

mysql> alter table test2 modify code varchar(333) character set utf8;
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table test2 modify code varchar(334) character set utf8;
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes

Well, using long string columns as primary key is generally a bed practice, however I've came across that problem when working with database of one commercial (!) product.

like image 33
Danubian Sailor Avatar answered Oct 03 '22 21:10

Danubian Sailor