Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I search by emoji in MySQL using utf8mb4?

Please help me understand how multibyte characters like emoji's are handled in MySQL utf8mb4 fields.

See below for a simple test SQL to illustrate the challenges.

/* Clear Previous Test */
DROP TABLE IF EXISTS `emoji_test`;
DROP TABLE IF EXISTS `emoji_test_with_unique_key`;

/* Build Schema */
CREATE TABLE `emoji_test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `emoji_test_with_unique_key` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_string_status` (`string`,`status`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

/* INSERT data */
# Expected Result is successful insert for each of these.
# However some fail. See comments.
INSERT INTO emoji_test (`string`, `status`) VALUES ('🌶', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('🌮', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('🌮🌶', 1);                 # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('🌶🌮', 1);                 # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('🌶', 1);   # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('🌮', 1);   # FAIL: Duplicate entry '?-1' for key 'idx_string_status'
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('🌮🌶', 1); # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('🌶🌮', 1); # FAIL: Duplicate entry '??-1' for key 'idx_string_status'

/* Test data */

    /* Simple Table */
SELECT * FROM emoji_test WHERE `string` IN ('🌶','🌮','🌮🌶','🌶🌮'); # SUCCESS (all 4 are found)
SELECT * FROM emoji_test WHERE `string` IN ('🌶');                     # FAIL: Returns both 🌶 and 🌮
SELECT * FROM emoji_test WHERE `string` IN ('🌮');                     # FAIL: Returns both 🌶 and 🌮
SELECT * FROM emoji_test;                                              # SUCCESS (all 4 are found)

    /* Table with Unique Key */
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('🌶','🌮','🌮🌶','🌶🌮'); # FAIL: Only 2 are found (due to insert errors above)
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('🌶');                     # SUCCESS
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('🌮');                     # FAIL: 🌶 found instead of 🌮
SELECT * FROM emoji_test_with_unique_key;                                              # FAIL: Only 2 records found (🌶 and 🌮🌶)

I'm interested in learning what causes the FAILs above and how I can get around this.

Specifically:

  1. Why do selects for one multibyte character return results for any multibyte character?
  2. How can I configure an index to handle multibyte characters instead of ??
  3. Can you recommend changes to the second CREATE TABLE (the one with a unique key) above in such a way that makes all the test queries return successfully?
like image 654
Ryan Avatar asked Dec 14 '16 16:12

Ryan


People also ask

How do I search for Emojis in MySQL?

You use utf8mb4_unicode_ci for your columns, so the check is case insensitive. If you use utf8mb4_bin instead, then the emoji 🌮 and 🌶 are correctly identified as different letters. With WEIGHT_STRING you can get the values that are use for sorting and comparison for the input string.

How do I match a string in MySQL?

STRCMP() function in MySQL is used to compare two strings. If both of the strings are same then it returns 0, if the first argument is smaller than the second according to the defined order it returns -1 and it returns 1 when the second one is smaller the first one.

Does PostgreSQL support Emoji?

emoji is a pure SQL PostgreSQL extension to encode/decode bytea/text to/from emoji. A lookup-table is constructed from the first 1024 emojis from [https://unicode.org/Public/emoji/13.1/emoji-test.txt], where each emoji maps to a unique 10 bit sequence.


2 Answers

You use utf8mb4_unicode_ci for your columns, so the check is case insensitive. If you use utf8mb4_bin instead, then the emoji 🌮 and 🌶 are correctly identified as different letters.

With WEIGHT_STRING you can get the values that are use for sorting and comparison for the input string.

If you write:

SELECT
  WEIGHT_STRING ('🌮' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('🌶' COLLATE 'utf8mb4_unicode_ci')

Then you can see that both are 0xfffd. In Unicode Character Sets they say:

For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER.

If you write:

SELECT 
  WEIGHT_STRING('🌮' COLLATE 'utf8mb4_bin'),
  WEIGHT_STRING('🌶' COLLATE 'utf8mb4_bin')

You will get their unicode values 0x01f32e and 0x01f336 instead.

For other letters like Ä, Á and A that are equal if you use utf8mb4_unicode_ci, the difference can be seen in:

SELECT
  WEIGHT_STRING ('Ä' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('A' COLLATE 'utf8mb4_unicode_ci')

Those map to to the weight 0x0E33

Ä: 00C4  ; [.0E33.0020.0008.0041][.0000.0047.0002.0308] # LATIN CAPITAL LETTER A WITH DIAERESIS; QQCM
A: 0041  ; [.0E33.0020.0008.0041] # LATIN CAPITAL LETTER A

According to : Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL? the weights used for utf8mb4_unicode_ci are based on UCA 4.0.0 because the emoji do not appear in there, the mapped weight is 0xfffd

If you need case insensitive compares and sorts for regular letters along with emoji then this problem is solved using utf8mb4_unicode_520_ci:

SELECT
  WEIGHT_STRING('🌮' COLLATE 'utf8mb4_unicode_520_ci'),
  WEIGHT_STRING('🌶' COLLATE 'utf8mb4_unicode_520_ci')

there will also get different weights for those emoji 0xfbc3f32e and 0xfbc3f336.

like image 103
t.niese Avatar answered Oct 04 '22 02:10

t.niese


Don't need to go to weights. Do something like this to see whether two characters (or strings) are equal.

mysql> SELECT '🌮' = '🌶' COLLATE utf8mb4_unicode_ci;
+--------------------------------------+
| '?' = '?' COLLATE utf8mb4_unicode_ci |
+--------------------------------------+
|                                    1 |  1 = true, hence equal
+--------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT '🌮' = '🌶' COLLATE utf8mb4_unicode_520_ci;
+------------------------------------------+
| '?' = '?' COLLATE utf8mb4_unicode_520_ci |
+------------------------------------------+
|                                        0 |  unequal
+------------------------------------------+
1 row in set (0.00 sec)
like image 45
Rick James Avatar answered Oct 04 '22 00:10

Rick James