Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode and :alpha:

Tags:

regex

elixir

Why is this false:

iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false

But this is true?:

iex(2)> String.match?("汉语漢語", ~r/[[:alpha:]]/)
true

Sometimes [:alpha:] is unicode and sometimes it's not?

EDIT:

I don't think my original example was clear enough.

Why is this false:

iex(1)> String.match?("汉", ~r/^[[:alpha:]]+$/)
false

But this is true?:

iex(2)> String.match?("汉", ~r/[[:alpha:]]/)
true
like image 984
mwoods79 Avatar asked Nov 07 '15 18:11

mwoods79


1 Answers

When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語") (12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158) and IO.puts String.length("汉语漢語") (4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:] POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.

To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u modifier:

IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)

See the IDEONE demo (prints true)

See Elixir regex reference:

unicode (u) - enables unicode specific patterns like \p and changes modifiers like \w, \W, \s and friends to also match on unicode. It expects valid unicode strings to be given on match.

like image 171
Wiktor Stribiżew Avatar answered Sep 24 '22 10:09

Wiktor Stribiżew