Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I match capital ÅÄÖ in mysql regexp

When I do a REGEXP comparison in MySQL, I get some strange results for the capital versions of the Swedish characters. I am using the utf8_swedish_ci collation and I want to find capitalized words.

SELECT 'Öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' should return 1 and SELECT 'öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' should return 0, but I get the opposite result.

SELECT 'Öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 0 (incorrect)
SELECT 'öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 1 (incorrect)
SELECT 'Söder' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 1 (correct)
SELECT 'söder' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 0 (correct)

If I use REGEXP instead of REGEXP BINARY, 'söder' will also match (which is not what I want), but even then 'Öster' is not a match.

What should I do about this?

like image 713
Snurolss Avatar asked Oct 21 '22 00:10

Snurolss


1 Answers

I realize you've found a fix, but wanted to explain why it works. REGEXP in MySQL doesn't work with "characters" but works with bytes. Å, Ä, Ö, å, ä, and ö are all two byte characters in UTF-8. When they are used in the regex [ ] construct, the regex engine sees each of these bytes individually and only attempts to match one byte rather than the two bytes that composes the whole character. If you decompose these characters into their constituent bytes, you can see why some matches happened by fluke.

Your fix of using the regex '^([A-Z]|Å|Ä|Ö)[a-zåäö]+$' technically works, but it's by chance that the bytes which compose å, ä, and ö don't actually allow any other unintended well-formed UTF-8 character strings to accidentally match.

I would recommend using '^([A-Z]|Å|Ä|Ö)([a-z]|å|ä|ö)+$' for clarity.

like image 99
DPenner1 Avatar answered Oct 31 '22 17:10

DPenner1