Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mysql four byte chinese characters support

Tags:

mysql

utf-8

I can't execute this SQL script:

INSERT INTO `mabase`.`new_table` (`idnew_table`, `name`) VALUES ('2', '𠼭');

the error is:

ERROR 1366: Incorrect string value: '\xF0\xA0\xBC\xAD' for column 'name' at row 1 SQL Statement: INSERT INTO mabase.new_table (idnew_table, name) VALUES ('2', '𠼭')

My database and table is in utf8 charset and utf8_general_ci collation. Also i tried: utf8_unicode_ci, utf8mb4_general_ci, bg5_cinese_ci, gbk_cinese_ci.

I've tried all that in MySql workbench on windows.

𠼭 is four byte character. I have problems only with them. Tell me Please how can I save four byte character in mysql.

like image 380
yaroslav prokipchyn Avatar asked Jul 16 '13 15:07

yaroslav prokipchyn


1 Answers

Your desired character, U+20F2D, resides in the "CJK Unified Ideographs Extension B" block of Unicode's "Supplementary Ideographic Plane" and therefore was not available in any MySQL Unicode character set prior to v5.5; since v5.5, it is available in the utf8mb4, utf16, utf16le and utf32 character sets.

It is not available in MySQL's big5 or gbk character sets.


Why the utf8 encoding does not work

As documented under Unicode Support:

The initial implementation of Unicode support (in MySQL 4.1) included two character sets for storing Unicode data:

  • ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character.

  • utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character.

These two character sets support the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0. BMP characters have these characteristics:

  • Their code values are between 0 and 65535 (or U+0000 .. U+FFFF).

  • They can be encoded with a fixed 16-bit word, as in ucs2.

  • They can be encoded with 8, 16, or 24 bits, as in utf8.

  • They are sufficient for almost all characters in major languages.

Characters not supported by the aforementioned character sets include supplementary characters that lie outside the BMP. Characters outside the BMP compare as REPLACEMENT CHARACTER and convert to '?' when converted to a Unicode character set.

In MySQL 5.6, Unicode support includes supplementary characters, which requires new character sets that have a broader range and therefore take more space. The following table shows a brief feature comparison of previous and current Unicode support.

╔══════════════════════════════╦══════════════════════════════════════════════╗
║       Before MySQL 5.5MySQL 5.5 and up                ║
╠══════════════════════════════╬══════════════════════════════════════════════╣
║ All Unicode 3.0 characters   ║ All Unicode 5.0 and 6.0 characters           ║
╠══════════════════════════════╬══════════════════════════════════════════════╣
║ No supplementary characters  ║ With supplementary characters                ║
╠══════════════════════════════╬══════════════════════════════════════════════╣
║ ucs2 character set, BMP only ║ No change                                    ║
╠══════════════════════════════╬══════════════════════════════════════════════╣
║ utf8 character set for up to ║ No change                                    ║
║ three bytes, BMP only        ║                                              ║
╠══════════════════════════════╬══════════════════════════════════════════════╣
║                              ║ New utf8mb4 character set for up to four     ║
║                              ║ bytes, BMP or supplemental                   ║
╠══════════════════════════════╬══════════════════════════════════════════════╣
║                              ║ New utf16 character set, BMP or supplemental ║
╠══════════════════════════════╬══════════════════════════════════════════════╣
║                              ║ New utf16le character set, BMP or            ║
║                              ║ supplemental (5.6.1 and up)                  ║
╠══════════════════════════════╬══════════════════════════════════════════════╣
║                              ║ New utf32 character set, BMP or supplemental ║
╚══════════════════════════════╩══════════════════════════════════════════════╝

These changes are upward compatible. If you want to use the new character sets, there are potential incompatibility issues for your applications; see Section 10.1.11, “Upgrading from Previous to Current Unicode Support”. That section also describes how to convert tables from utf8 to the (4-byte) utf8mb4 character set, and what constraints may apply in doing so.

Why the big5 encoding does not work

As documented under What problems should I be aware of when working with the Big5 Chinese character set?:

MySQL supports the Big5 character set which is common in Hong Kong and Taiwan (Republic of China). MySQL's big5 is in reality Microsoft code page 950, which is very similar to the original big5 character set.

[ deletia ]

A feature request for adding HKSCS extensions has been filed. People who need this extension may find the suggested patch for Bug #13577 to be of interest.

Why the gbk encoding does not work

As documented under What CJK character sets are available in MySQL?:

Here, we try to clarify exactly what characters are legitimate in gb2312 or gbk, with reference to the official documents. Please check these references before reporting gb2312 or gbk bugs.

  • For a complete listing of the gb2312 characters, ordered according to the gb2312_chinese_ci collation: gb2312

  • MySQL's gbk is in reality “Microsoft code page 936”. This differs from the official gbk for characters A1A4 (middle dot), A1AA (em dash), A6E0-A6F5, and A8BB-A8C0.

  • For a listing of gbk/Unicode mappings, see http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT.

  • For MySQL's listing of gbk characters, see gbk.

like image 113
eggyal Avatar answered Oct 16 '22 20:10

eggyal