How can I search by emoji in MySQL using utf8mb4?

Tags:

Please help me understand how multibyte characters like emoji's are handled in MySQL utf8mb4 fields.

See below for a simple test SQL to illustrate the challenges.

/* Clear Previous Test */
DROP TABLE IF EXISTS `emoji_test`;
DROP TABLE IF EXISTS `emoji_test_with_unique_key`;

/* Build Schema */
CREATE TABLE `emoji_test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `emoji_test_with_unique_key` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_string_status` (`string`,`status`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

/* INSERT data */
# Expected Result is successful insert for each of these.
# However some fail. See comments.
INSERT INTO emoji_test (`string`, `status`) VALUES ('🌶', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('🌮', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('🌮🌶', 1);                 # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('🌶🌮', 1);                 # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('🌶', 1);   # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('🌮', 1);   # FAIL: Duplicate entry '?-1' for key 'idx_string_status'
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('🌮🌶', 1); # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('🌶🌮', 1); # FAIL: Duplicate entry '??-1' for key 'idx_string_status'

/* Test data */

    /* Simple Table */
SELECT * FROM emoji_test WHERE `string` IN ('🌶','🌮','🌮🌶','🌶🌮'); # SUCCESS (all 4 are found)
SELECT * FROM emoji_test WHERE `string` IN ('🌶');                     # FAIL: Returns both 🌶 and 🌮
SELECT * FROM emoji_test WHERE `string` IN ('🌮');                     # FAIL: Returns both 🌶 and 🌮
SELECT * FROM emoji_test;                                              # SUCCESS (all 4 are found)

    /* Table with Unique Key */
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('🌶','🌮','🌮🌶','🌶🌮'); # FAIL: Only 2 are found (due to insert errors above)
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('🌶');                     # SUCCESS
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('🌮');                     # FAIL: 🌶 found instead of 🌮
SELECT * FROM emoji_test_with_unique_key;                                              # FAIL: Only 2 records found (🌶 and 🌮🌶)

I'm interested in learning what causes the FAILs above and how I can get around this.

Specifically:

Why do selects for one multibyte character return results for any multibyte character?
How can I configure an index to handle multibyte characters instead of ??
Can you recommend changes to the second CREATE TABLE (the one with a unique key) above in such a way that makes all the test queries return successfully?

654

asked Dec 14 '16 16:12

Ryan

2 Answers

You use utf8mb4_unicode_ci for your columns, so the check is case insensitive. If you use utf8mb4_bin instead, then the emoji 🌮 and 🌶 are correctly identified as different letters.

With WEIGHT_STRING you can get the values that are use for sorting and comparison for the input string.

If you write:

SELECT
  WEIGHT_STRING ('🌮' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('🌶' COLLATE 'utf8mb4_unicode_ci')

Then you can see that both are 0xfffd. In Unicode Character Sets they say:

For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER.

If you write:

SELECT 
  WEIGHT_STRING('🌮' COLLATE 'utf8mb4_bin'),
  WEIGHT_STRING('🌶' COLLATE 'utf8mb4_bin')

You will get their unicode values 0x01f32e and 0x01f336 instead.

For other letters like Ä, Á and A that are equal if you use utf8mb4_unicode_ci, the difference can be seen in:

SELECT
  WEIGHT_STRING ('Ä' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('A' COLLATE 'utf8mb4_unicode_ci')

Those map to to the weight 0x0E33

Ä: 00C4  ; [.0E33.0020.0008.0041][.0000.0047.0002.0308] # LATIN CAPITAL LETTER A WITH DIAERESIS; QQCM
A: 0041  ; [.0E33.0020.0008.0041] # LATIN CAPITAL LETTER A

According to : Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL? the weights used for utf8mb4_unicode_ci are based on UCA 4.0.0 because the emoji do not appear in there, the mapped weight is 0xfffd

If you need case insensitive compares and sorts for regular letters along with emoji then this problem is solved using utf8mb4_unicode_520_ci:

SELECT
  WEIGHT_STRING('🌮' COLLATE 'utf8mb4_unicode_520_ci'),
  WEIGHT_STRING('🌶' COLLATE 'utf8mb4_unicode_520_ci')

there will also get different weights for those emoji 0xfbc3f32e and 0xfbc3f336.

103

answered Oct 04 '22 02:10

t.niese

Don't need to go to weights. Do something like this to see whether two characters (or strings) are equal.

mysql> SELECT '🌮' = '🌶' COLLATE utf8mb4_unicode_ci;
+--------------------------------------+
| '?' = '?' COLLATE utf8mb4_unicode_ci |
+--------------------------------------+
|                                    1 |  1 = true, hence equal
+--------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT '🌮' = '🌶' COLLATE utf8mb4_unicode_520_ci;
+------------------------------------------+
| '?' = '?' COLLATE utf8mb4_unicode_520_ci |
+------------------------------------------+
|                                        0 |  unequal
+------------------------------------------+
1 row in set (0.00 sec)

answered Oct 04 '22 00:10

Rick James

Related questions
                            
                                MySQL vs Firebird vs SQLite [closed]
                            
                                SQL CONCAT - Funny characters - but I'm not laughing
                            
                                MySQL IN clause: max number of arguments
                            
                                Does anybody have a development/staging/deploying workflow with php/mysql? [closed]
                            
                                MySQL not using indexes ("Using filesort") when using ORDER BY
                            
                                How do I alias a database in MySQL?
                            
                                Is there a generic workaround to express a derived column list in Oracle (and MySQL)?
                            
                                Select only last value using group by at mysql
                            
                                Connect MySQL with Python 3.6 [closed]
                            
                                mysql_fetch_array, mysql_fetch_assoc, mysql_fetch_object
                            
                                Parse large JSON file [duplicate]
                            
                                Why is MySQL slow when using LIMIT in my query?
                            
                                PHP's PDO is ignoring the ATTR_TIMEOUT option for MySQL when server cannot be reached
                            
                                Speeding up conversion from MyISAM to InnoDB
                            
                                Do indexes speed up greater than > comparison in MySQL?
                            
                                MySql Foreign keys: ON DELETE NO ACTION behavour - how to leave info in referenced field?
                            
                                How to expose a MySQL database as OData
                            
                                where is the actual data in a mysql db stored on a linux machine? [closed]
                            
                                Efficient method to find collision free random numbers
                            
                                Difference between using REFERENCES with and without FOREIGN KEY?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I search by emoji in MySQL using utf8mb4?

Tags:

sql

mysql

emoji

utf8mb4

Ryan

People also ask

2 Answers

t.niese

Rick James

Recent Activity

Donate For Us