Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clarification on potential issues with equality of accented characters with non-binary collations

For a website with international support, I am using utf8mb4 charset and utf8mb4_unicode_ci collation in most tables and columns. Performance is not paramount and accurate sorting in a variety of languages is important.

I understand how the utf8mb4_general_ci and utf8mb4_unicode_ci collations work with comparisons in general with accented characters, namely that:

SELECT column FROM table WHERE column='abad';

Would return both 'abad' and 'abád'

While researching utf8 support in MySQL, I came across a supposed problem with the non-binary utf8___ collations. The page at http://mzsanford.com/blog/mysql-and-unicode/ describes a problem with changes not being saved in some updates. He says 'When updating a record it appears MySQL (or at least InnoDB) checks for equality before updating a record. Since an accent-only change is considered equal by the collation MySQL skips the write (which saves I/O overhead) and returns success since it thinks it optimized a write rather than failing.'

I interpret that as: if you tried to update a record making a change only in the accents of a field, it won't update properly (because MySQL thinks it already matches). But I was unable to replicate this. I created a simple test case:

CREATE DATABASE test_utf8 
    CHARACTER SET utf8mb4 
    COLLATE utf8mb4_unicode_ci;

USE test_utf8;

CREATE TABLE test (
    id MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT, 
    text VARCHAR(300) NOT NULL, 
    PRIMARY KEY (id)
) ENGINE = INNODB;

INSERT INTO test (text) VALUES ('abád');

UPDATE test SET text='abad' WHERE id=1;

However, this updates the value correctly (despite only a change in accent on one character). Was this only an issue in an older version of MySQL perhaps? Or does this issue crop up in slightly different circumstances?


I would also appreciate if you have a moment to read through some of my notes of a few concepts around the subject and see if I have any misconceptions. If it is error-free, perhaps it will be useful information for someone.

MySQL's utf8 character set doesn't offer true utf8 support since the characters are only 1-3 bytes. For true utf8 support, you will probably want to use utf8mb4.

In general, utf8mb4_unicode_ci will be more accurate with language appropriate sorting, but there is a slight performance hit as opposed to using utf8mb4_general_ci.

If certain columns don't ever need to be sorted, and will use comparison/equality checks, you should use utf8mb4_bin as it will be slightly faster.

Accented characters are treated as equal in both utf8mb4_general_ci and utf8mb4_unicode_ci collations. Because of this, it is a bad collation choice for columns that must have unique values (primary keys for example). In that case, utf8mb4_bin should be used. And if a field needs to be accent-aware for uniqueness but also needs to be language sorted at some point, it can be stored as utf8mb4_bin and you can use a collate clause in the query when ordering. Ex:

SELECT column FROM table ORDER BY column COLLATE utf8mb4_unicode_ci;

This will cause the ordering to be language sorted despite its internal storage in a binary collation. This will impact performance, since the collation of the field determines how it is indexed. The performance difference of the query would be similar to the performance difference when sorting an un-indexed column vs an indexed column.

By default, a search under the utf8mb4_unicode_ci or utf8mb4_general_ci collations will not be accent aware, so a search for 'abad' would return 'abad' and 'abád'. So if you want accent aware searches, you will have to either set the column's collation to utf8mb4_binary (if all searches will be accent-sensitive) or use a collate clause in the query (if you want most searches to be accent-blind). Since the utf8mb4_bin collation is case-sensitive, you will also need to modify the query if you want case-insensitive but accent-aware searching. For example (assuming your search term has already been made lowercase in the server-side scripting language):

(Assuming the data is stored with a collation of utf8mb4_bin)
SELECT column FROM table WHERE LOWERCASE(column) LIKE 'abád';

(Assuming the data is stored with a collation of utf8mb4_unicode_ci)
SELECT column FROM table WHERE LOWERCASE(column) LIKE 'abád' COLLATE utf8mb4_bin;

Also, from the MySQL Documentation (just including it for others): When comparing values from different columns, declare those columns with the same character set and collation wherever possible, to avoid string conversions while running the query.

like image 587
dnag Avatar asked Feb 20 '14 21:02

dnag


1 Answers

I'm not an expert, but I tried what you did with some extras...

I ran your setup and the following on MySQL 5.6.17:

SELECT COUNT(*) FROM test WHERE `text`='abad';
SELECT COUNT(*) FROM test WHERE `text`='abád';
UPDATE test SET text='abád' WHERE id=1;

The selects both return 1 row, as we expect, and the update (like your update) modifies 1 row, counter to what the blog suggests.

I thought it might be a lower-level optimization, but I noticed something interesting when I tried running this again in the command line client (instead of Workbench):

mysql> SELECT COUNT(*) FROM test WHERE `text`='abád';
ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and
(utf8_general_ci,COERCIBLE) for operation '='
mysql> UPDATE test SET text='abád' WHERE id=1;
ERROR 1366 (HY000): Incorrect string value: '\xA0d' for column 'text' at row 1

So I ran this to see what was going on:

mysql> SELECT collation('abád');
+-------------------+
| collation('abád')  |
+-------------------+
| utf8_general_ci   |
+-------------------+
1 row in set (0.00 sec)

There must be some coercion going on due to the set of my session...so I tried explicitly matching:

UPDATE test SET text='abad' COLLATE utf8_unicode_ci WHERE id=1;
UPDATE test SET text='abád' COLLATE utf8_unicode_ci WHERE id=1;

And still I got the same results (updated both times).

For now, I'm left with my guess, that InnoDB's optimization is done at a lower level than SELECTing against text criteria.

like image 170
Christopher McGowan Avatar answered Sep 28 '22 10:09

Christopher McGowan