Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Locale specific behavior in the regex library?

When I imbue a regex object with a particular locale, how does it affect the matching behavior? Does it affect collation, or anything else? I can't seem to find an explanation anywhere.

like image 251
Alex Korban Avatar asked Jan 28 '12 07:01

Alex Korban


2 Answers

It affects at least the following:

  • Collation: the regex [a-f] imbued with a French locale should match the character é.
  • Similarly, \w in a Finnish locale should match the character ä (but [a-z] should not, as å, ä and ö collate after z in Finnish. In German, however, [a-z] should match ä.)
  • In a Unicode compatible locale, the Unicode equivalence algorithm should be used, so that composed forms of a character match a decomposed form and vice versa.
  • With a POSIX-compatible regex flavor (basic, extended, awk, grep, and egrep), the POSIX character classes should be locale-aware: [=e=] should match é in a French locale but not in an English locale.
like image 186
JohannesD Avatar answered Oct 08 '22 07:10

JohannesD


On the spanish locale, please note that "ch" and "ll" are not considered single letters in the alphabet any more, as this was changed by relevant entities. I could not find the exact date, but it right now "ch" and "ll" are two letters:

http://en.wikipedia.org/wiki/Ll

I think implementations now reflect that fact.

like image 22
Tom K Avatar answered Oct 08 '22 06:10

Tom K