How can I match capital ÅÄÖ in mysql regexp

Question

When I do a REGEXP comparison in MySQL, I get some strange results for the capital versions of the Swedish characters. I am using the utf8_swedish_ci collation and I want to find capitalized words.

SELECT 'Öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' should return 1 and SELECT 'öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' should return 0, but I get the opposite result.

SELECT 'Öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 0 (incorrect)
SELECT 'öster' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 1 (incorrect)
SELECT 'Söder' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 1 (correct)
SELECT 'söder' REGEXP BINARY '^[A-ZÅÄÖ][a-zåäö]+$' # returns 0 (correct)

If I use REGEXP instead of REGEXP BINARY, 'söder' will also match (which is not what I want), but even then 'Öster' is not a match.

What should I do about this?

DPenner1 · Accepted Answer

I realize you've found a fix, but wanted to explain why it works. REGEXP in MySQL doesn't work with "characters" but works with bytes. Å, Ä, Ö, å, ä, and ö are all two byte characters in UTF-8. When they are used in the regex [ ] construct, the regex engine sees each of these bytes individually and only attempts to match one byte rather than the two bytes that composes the whole character. If you decompose these characters into their constituent bytes, you can see why some matches happened by fluke.

Your fix of using the regex '^([A-Z]|Å|Ä|Ö)[a-zåäö]+$' technically works, but it's by chance that the bytes which compose å, ä, and ö don't actually allow any other unintended well-formed UTF-8 character strings to accidentally match.

I would recommend using '^([A-Z]|Å|Ä|Ö)([a-z]|å|ä|ö)+$' for clarity.

How can I match capital ÅÄÖ in mysql regexp

Tags:

regex

mysql

unicode

Snurolss

1 Answers

DPenner1

Recent Activity

Donate For Us

How can I match capital ÅÄÖ in mysql regexp

Tags:

regex

mysql

unicode

Snurolss

1 Answers

DPenner1

Related questions

Recent Activity

Donate For Us