Unicode and :alpha:

Question

Why is this false:

iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false

But this is true?:

iex(2)> String.match?("汉语漢語", ~r/[[:alpha:]]/)
true

Sometimes [:alpha:] is unicode and sometimes it's not?

EDIT:

I don't think my original example was clear enough.

Why is this false:

iex(1)> String.match?("汉", ~r/^[[:alpha:]]+$/)
false

But this is true?:

iex(2)> String.match?("汉", ~r/[[:alpha:]]/)
true

Wiktor Stribiżew · Accepted Answer

When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語") (12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158) and IO.puts String.length("汉语漢語") (4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:] POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.

To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u modifier:

IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)

See the IDEONE demo (prints true)

See Elixir regex reference:

unicode (u) - enables unicode specific patterns like \p and changes modifiers like \w, \W, \s and friends to also match on unicode. It expects valid unicode strings to be given on match.

Unicode and :alpha:

Tags:

regex

elixir

EDIT:

mwoods79

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

Unicode and :alpha:

Tags:

regex

elixir

EDIT:

mwoods79

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us