In Elixir, if I have a string such as "José1 José2", how do I enumerate it? If I try to use Enum or for comprehensions, I get the following error:
** (Protocol.UndefinedError) protocol Enumerable not implemented for "José1 José2" of type BitString
Strings in Elixir are UTF-8 encoded binaries. If you want to enumerate a binary, which is just a collection of bytes, you need to specify how.
String.graphemes/1 - this will give you a list of strings, where each string contains an individual Unicode grapheme. This is probably closest to what you mean if you want each "character".
iex> String.graphemes("José1 José2")
["J", "o", "s", "é", "1", " ", "J", "o", "s", "é", "2"]
String.codepoints/1 - this will give you a list of strings broken down by Unicode codepoints. Note that a Unicode codepoint does not necessarily translate to a human-readable character.
iex> String.codepoints("José1 José2")
["J", "o", "s", "é", "1", " ", "J", "o", "s", "e", "́", "2"]
You can see that the first and the second é graphemes are represented differently in terms of unicode codepoints. The first one is LATIN SMALL LETTER E WITH ACUTE (U+00E9), whereas the second one is LATIN SMALL LETTER E (U+0065) followed by COMBINING ACUTE ACCENT (U+0301).
This is why you can't simply enumerate a string, because when dealing with Unicode, you have to specify whether you are interested in graphemes, or codepoints, or something else.
String.to_charlist/1 - gives you a list of the numerical Unicode codepoints of the string. This is can be used to interface with Erlang libraries that use this format.
iex> String.to_charlist("José1 José2")
[74, 111, 115, 233, 49, 32, 74, 111, 115, 101, 769, 50]
:binary.bin_to_list/1 - If you just want to enumerate the bytes.
iex> :binary.bin_to_list("José1 José2")
[74, 111, 115, 195, 169, 49, 32, 74, 111, 115, 101, 204, 129, 50]
Once you have a list, you can Enumerate it using comprehensions or any of the functions in the Enum module:
iex> for c <- String.graphemes("José1 José2"), into: "", do: c <> c
"JJoosséé11 JJoosséé22"
iex> "José1 José2" |> String.graphemes() |> Enum.join("|")
"J|o|s|é|1| |J|o|s|é|2"
It is also possible to use comprehensions with bitstring generators for enumerating the bytes and codepoints (but not the graphemes).
Equivalent to :binary.bin_to_list/1:
iex> for <<byte <- "José1 José2">>, do: byte
[74, 111, 115, 195, 169, 49, 32, 74, 111, 115, 101, 204, 129, 50]
Equivalent to String.to_charlist/1, by specifying the type of the binary is utf8:
iex> for <<cp::utf8 <- "José1 José2">>, do: cp
[74, 111, 115, 233, 49, 32, 74, 111, 115, 101, 769, 50]
Equivalent to String.codepoints/1, by specifying the type of the binary is utf8, and converting the resulting codepoints back to UTF-8 binaries:
iex> for <<cp::utf8 <- "José1 José2">>, do: <<cp::utf8>>
["J", "o", "s", "é", "1", " ", "J", "o", "s", "e", "́", "2"]
P.S. For further reading about character encodings, this blog post from 2003 is great: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With