According to MySQL, a text
column holds 65,535 bytes.
So if this a legitimate boundary then will it actually only fit about 32k UTF-8 characters, right? Or is this one of those "fuzzy" boundaries where the guys that wrote the docs can't tell characters from bytes and it will actually allow ~64k UTF-8 characters if set to something like utf8_general_ci
?
A TEXT column with a maximum length of 16,777,215 (224 − 1) characters. The effective maximum length is less if the value contains multibyte characters. Each MEDIUMTEXT value is stored using a 3-byte length prefix that indicates the number of bytes in the value.
TEXT: 65,535 characters - 64 KB The standard TEXT data object is sufficiently capable of handling typical long-form text content. TEXT data objects top out at 64 KB (expressed as 2^16 -1) or 65,535 characters and requires a 2 byte overhead.
UTF-8 Basics. UTF-8 (Unicode Transformation–8-bit) is an encoding defined by the International Organization for Standardization (ISO) in ISO 10646. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.
A text
column can be up to 65,535
bytes.
An utf-8
character can be up to 3 bytes.
So... your actual limit can be 21,844
characters.
See the manual for more info: http://dev.mysql.com/doc/refman/5.1/en/string-type-overview.html
A variable-length string. M represents the maximum column length in characters. The range of M is 0 to 65,535. The effective maximum length of a VARCHAR is subject to the maximum row size (65,535 bytes, which is shared among all columns) and the character set used. For example, utf8 characters can require up to three bytes per character, so a VARCHAR column that uses the utf8 character set can be declared to be a maximum of 21,844 characters.
UTF-8 characters can take up to 4 bytes each, not 2 as you are supposing. UTF-8 is a variable-width encoding, depending on the number of significant bits in the Unicode code point:
The original UTF-8 spec allows encoding up to 31-bit Unicode values, taking as many as 6 bytes to encode in UTF-8 form. After UTF-8 became popular, the Unicode Consortium declared that they will never use code points beyond 221 - 1. This is now standardized as RFC 3629.
MySQL currently (i.e. version 5.6) only supports the Unicode Basic Multilingual Plane characters, for which UTF-8 needs up to 3 bytes per character. That means the current answer to your question is that your TEXT
field can hold at least 21,844 characters.
Depending on how you look at it, the actual limits are higher or lower than that:
If you assume, as I do, that the BMP limitation will eventually be lifted in MySQL or one of its forks, you shouldn't count on being able to store more than 16,383 characters in that field if your MySQL client allows arbitrary Unicode text input.
On the other hand, you may be able to exploit the fact that UTF-8 is a variable width encoding. If you know your text is mostly plain English with just the occasional non-ASCII character, your effective in-practice limit could approach the maximum 64 KB - 1 character limit.
However, when used as primary key, MySQL assumes that each limit of column's size adds 3 bytes to key.
mysql> alter table test2 modify code varchar(333) character set utf8;
Query OK, 0 rows affected (0.05 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> alter table test2 modify code varchar(334) character set utf8;
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes
Well, using long string columns as primary key is generally a bed practice, however I've came across that problem when working with database of one commercial (!) product.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With