In erlang, when defining a UTF-8 binary string, I need to specify the encoding in the binary literal, like this:
Star = <<"★"/utf8>>.
> <<226,152,133>>
io:format("~ts~n", [Star]).
> ★
> ok
But, if the /utf8
encoding is omitted, the unicode characters are not handled correctly:
Star1 = <<"★">>.
> <<5>>
io:format("~ts~n", [Star1]).
> ^E
> ok
Is there a way that I can create literal binary strings like this without having to specify /utf8
in every binary I create? My code has quite a few binaries like this and things have become quite cluttered. Is there a way to set some sort of default encoding for binaries?
UTF-8 is the universal code page for internationalization and is able to encode the entire Unicode character set. It is used pervasively on the web, and is the default for *nix-based platforms.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
This is probably a result of the ambiguity of Erlang strings and lists. When you enter <<"★">>
, what Erlang is actually seeing is <<[9733]>>
, which, of course, is just a list containing an integer. As such, I believe Erlang in this case would encode 9733 as an integer, most likely with 16-bits (though I could certainly be wrong on that).
The /utf8
flag indicates to Erlang that this is supposed to be a UTF8 string, and thus gives a hint to the VM about how best to encode the integer it encounters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With