Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL: Querying for unicode entities

Tags:

mysql

unicode

I need to find the word Lämmönmyyntipalvelut from the database. Only, in the database it is in a field, whose value has been a PHP array, converted into JSON using json_encode() and so the special characters are scrabled into hex unicode.

So my query is

SELECT * FROM table WHERE (services LIKE '%Lämmönmyyntipalvelut%')

No results. No surprise. Next, query with special characters converted:

SELECT * FROM table WHERE (services LIKE '%L\u00e4mm\u00f6nmyyntipalvelut%')

No results and I wonder why. Next I tested querying for only special character:

SELECT * FROM table WHERE (services LIKE '%\u00e4%')

Found what was supposed to find. Next I started adding stuff (L to beginning) to see where it went wrong:

SELECT * FROM table WHERE (services LIKE '%L\u00e4%')

No results. Another test:

SELECT * FROM table WHERE (services LIKE '%\u00e4mm%')

Found what was supposed to find.

So my conclusion is that the backslash is somehow messing things up, but I don't understand how?

EDIT:

Exact contents of services field:

["Neuvonta","L\u00e4mm\u00f6nmyyntipalvelut",
"Mets\u00e4-\/energiapuunkorjuupalvelut"]

Exact query:

SELECT id, uid, company_name, services, logo FROM rekisteroeidy_toimijaks 
WHERE 
    (services LIKE '%L\u00e4mm\u00f6nmyyntipalvelut%' AND 
    services LIKE '%Mets\u00e4-\/energiapuunkorjuupalvelut%') 
ORDER BY company_name ASC

I added some line breaks to help readability.

like image 634
Simon Josef Kok Avatar asked Nov 08 '12 10:11

Simon Josef Kok


People also ask

Does MySQL support Unicode?

MySQL supports multiple Unicode character sets: utf8mb4 : A UTF-8 encoding of the Unicode character set using one to four bytes per character. utf8mb3 : A UTF-8 encoding of the Unicode character set using one to three bytes per character. utf8 : An alias for utf8mb3 .

How do I find the character set of a MySQL database?

To see the default character set and collation for a given database, use these statements: USE db_name; SELECT @@character_set_database, @@collation_database; Alternatively, to display the values without changing the default database: SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.

Does MySQL allow special characters?

Alternatively, MySQL also has special character escape sequences as shown below: \0 - An ASCII NUL (0x00) character. \' - A single quote ( ' ) character. \" - A double quote ( " ) character.


1 Answers

I have absolutely no idea why, but triple escaping helps!

Well, that's only double-escaping, but yes it works and here's why: in MySQL, there is a second layer of escaping involved when you use the LIKE operator.

services LIKE '%L\\\\u00e4mm\\\\u00f6n%'

parsing that MySQL string literal gives you a comparison with the LIKE-query %L\\u00e4mm\\u00f6n%. Because MySQL treats \ in a LIKE query as an escape, that will actually match the literal string containing L\u00e4mm\u00f6n.

The reason for this is so that you can match strings against a query expression that contains a literal % or _ character. For example if I want to search a column for the literal string 100%, I can match it against 100\% (written in a query as '100\\%') and make sure I'm really getting one hundred percent and just not any string starting with a hundred.

It's unfortunate that MySQL uses backslash for both its LIKE query escaping and its string literal escaping, especially given that you're probably writing in an enclosing programming language that also uses them, ending up with actual triple-encoding, which looks like "services LIKE '%L\\\\\\\\u00e4mm\\\\\\\\u00f6n%'" - argh!

It's doubly unfortunate given that this behaviour is not ANSI SQL conformant, and won't work in any other database. ANSI SQL says that there is no escape character in LIKE queries by default, so if you want to match a literal % or _ you have to opt in by nominating an escape character of your own, eg.:

something LIKE '100=%' ESCAPE '='

For cross-database compatibility, it is best always to use the LIKE...ESCAPE form, and pick something other than the horrible backslash! (Aside - MySQL's backslashes for SQL string literal escaping aren't ANSI conformant either! But you can turn that misbehaviour off with the NO_BACKSLASH_ESCAPES sql_mode setting.)

Probably a better idea would be to break services out into a second table rather than squashing them into a single string column - ie. put your schema in First Normal Form. Then you could get a simple lookup of individual values rather than having to do a slow full-table-scan substring-match.

like image 62
bobince Avatar answered Sep 25 '22 03:09

bobince