I'm in the process of learning elixir and came across something that didn't make sense to me...
I'm trying to remove punctuation
"Freude schöner Götterfunken" |> String.replace(~r/[^\s\w]/, "") #=> <<70, 114, 101, 117, 100, 101, 32, 115, 99, 104, 195, 110, 101, 114, 32, 71, 195, 116, 116, 101, 114, 102, 117, 110, 107, 101, 110>>
"Freude schöner Götterfunken" |> String.replace(~r/[^\w]/, "") #=> <<70, 114, 101, 117, 100, 101, 32, 115, 99, 104, 195, 110, 101, 114, 32, 71, 195, 116, 116, 101, 114, 102, 117, 110, 107, 101, 110>>
"Freude schöner Götterfunken" |> String.replace(~r/\p{P}/, "") #=> <<70, 114, 101, 117, 100, 101, 32, 115, 99, 104, 195, 110, 101, 114, 32, 71, 195, 116, 116, 101, 114, 102, 117, 110, 107, 101, 110>>
"Freude schöner Götterfunken" |> String.replace(~r/\s/, "") #=> FreudeschönerGötterfunken
"Hi my name is bob" |> String.replace(~r/\w/, "") #=> " "
Regex.run(~r/[^\w]/, "Freude schöner Götterfunken") #=> [<<182>>]
This seems like a bug, but being a noob I'm assuming ignorance. Why isn't the replace returning the string?
You are right that String.replace/2 isn't returning a string as Elixir defines strings as utf-8 encoded binaries. However, this is not a bug because Elixir expects you to pass or perform valid operations on the arguments, as it won't verify all results (in virtue of being expensive).
For example, if you pass any of the binaries above to String.downcase/1
, Elixir will downcase the parts it knows about, ignoring the rest. The reason it works is because UTF-8 auto-synchronizes, so if we see something weird, we can skip the weird byte and continue doing the operation.
In other words, the philosophy to String handling in Elixir is to validate at the boundaries (like when opening files, doing I/O or reading from a database) and assume we are working with and performing valid operations throughout.
OK, with all that said, why your code does not work? The reason is that your regex does not have unicode enabled. Let's add the u
modifier then:
iex> "Freude schöner Götterfunken" |> String.replace(~r/[^\s\w]/u, "")
"Freude schöner Götterfunken"
Well, it doesn't solve your problem but at least the result is valid. Reading about unicode categories here means that we can't really solve this problem with unicode properties because ö
in your example is a single codepoint which matches the \p{L}
property.
Maybe the simplest solution in this case, assuming you want to only solve it for German, is to just traverse the binary keeping the bytes <= 127. Something like:
iex> for <<x <- "Freude schöner Götterfunken">>, x <= 127, into: "", do: <<x>>
"Freude schner Gtterfunken"
If you want a more complete solution, you should probably look into unicode transliteration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With