(Apologies if my terminology regarding Binaries is off - I'm still getting started with Elixir)
While solving one of the Exercism questions for Elixir, I noticed that String.replace as well as Regex.replace apparently convert German Umlauts to Binaries when using the [:alnum:] character class:
iex(1)> String.replace("ö", ~r/[[:alnum:]]/, "_")
<<95, 182>>
iex(2)> String.replace("ö", ~r/[^[:alnum:]]/, "_")
<<195, 95>>
iex(3)> String.replace("ö", ~r/[_]/, " ")
"ö"
Is this behaviour caused by my usage of the [:alnum:] character class? (what really baffles me is that both the first and the second version return a Binary)
You need to pass the u modifier to the Regex so that [:alnum:] and other such patterns match on Unicode strings.
iex(1)> String.replace("ö", ~r/[[:alnum:]]/u, "_")
"_"
iex(2)> String.replace("ö", ~r/[^[:alnum:]]/u, "_")
"ö"
From h Regex:
Modifiers
The modifiers available when creating a Regex are:
unicode (u) - enables unicode specific patterns like p and change modifiers like w, W, s and friends to also match on unicode. It expects valid unicode strings to be given on match
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With