Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does String.replace with character classes convert German umlauts to Binaries?

Tags:

elixir

(Apologies if my terminology regarding Binaries is off - I'm still getting started with Elixir)

While solving one of the Exercism questions for Elixir, I noticed that String.replace as well as Regex.replace apparently convert German Umlauts to Binaries when using the [:alnum:] character class:

iex(1)> String.replace("ö", ~r/[[:alnum:]]/, "_")
<<95, 182>>
iex(2)> String.replace("ö", ~r/[^[:alnum:]]/, "_")
<<195, 95>>
iex(3)> String.replace("ö", ~r/[_]/, " ")
"ö"

Is this behaviour caused by my usage of the [:alnum:] character class? (what really baffles me is that both the first and the second version return a Binary)

like image 258
Frank Schmitt Avatar asked Feb 25 '26 05:02

Frank Schmitt


1 Answers

You need to pass the u modifier to the Regex so that [:alnum:] and other such patterns match on Unicode strings.

iex(1)> String.replace("ö", ~r/[[:alnum:]]/u, "_")
"_"
iex(2)> String.replace("ö", ~r/[^[:alnum:]]/u, "_")
"ö"

From h Regex:

Modifiers

The modifiers available when creating a Regex are:

  • unicode (u) - enables unicode specific patterns like p and change modifiers like w, W, s and friends to also match on unicode. It expects valid unicode strings to be given on match

    ...

like image 68
Dogbert Avatar answered Mar 01 '26 03:03

Dogbert