Why is this false
:
iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false
But this is true
?:
iex(2)> String.match?("汉语漢語", ~r/[[:alpha:]]/)
true
Sometimes [:alpha:]
is unicode and sometimes it's not?
I don't think my original example was clear enough.
Why is this false
:
iex(1)> String.match?("汉", ~r/^[[:alpha:]]+$/)
false
But this is true
?:
iex(2)> String.match?("汉", ~r/[[:alpha:]]/)
true
When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語")
(12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158
) and IO.puts String.length("汉语漢語")
(4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:]
POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.
To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u
modifier:
IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)
See the IDEONE demo (prints true
)
See Elixir regex reference:
unicode (u)
- enables unicode specific patterns like\p
and changes modifiers like\w
,\W
,\s
and friends to also match on unicode. It expects valid unicode strings to be given on match.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With