When I do a REGEXP comparison in MySQL, I get some strange results for the capital versions of the Swedish characters. I am using the utf8_swedish_ci collation and I want to find capitalized words.
SELECT 'Öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$'
should return 1 and SELECT 'öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$'
should return 0, but I get the opposite result.
SELECT 'Öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 0 (incorrect)
SELECT 'öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 1 (incorrect)
SELECT 'Söder' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 1 (correct)
SELECT 'söder' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 0 (correct)
If I use REGEXP instead of REGEXP BINARY, 'söder' will also match (which is not what I want), but even then 'Öster' is not a match.
What should I do about this?
I realize you've found a fix, but wanted to explain why it works. REGEXP
in MySQL doesn't work with "characters" but works with bytes. Å, Ä, Ö, å, ä, and ö are all two byte characters in UTF-8. When they are used in the regex [ ]
construct, the regex engine sees each of these bytes individually and only attempts to match one byte rather than the two bytes that composes the whole character. If you decompose these characters into their constituent bytes, you can see why some matches happened by fluke.
Your fix of using the regex '^([A-Z]|Å|Ä|Ö)[a-zåäö]+$'
technically works, but it's by chance that the bytes which compose å, ä, and ö don't actually allow any other unintended well-formed UTF-8 character strings to accidentally match.
I would recommend using '^([A-Z]|Å|Ä|Ö)([a-z]|å|ä|ö)+$'
for clarity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With