Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Must UTF-8 binaries include /utf8 in the binary literal in Erlang?

In erlang, when defining a UTF-8 binary string, I need to specify the encoding in the binary literal, like this:

Star = <<"★"/utf8>>.
> <<226,152,133>>
io:format("~ts~n", [Star]).
> ★
> ok

But, if the /utf8 encoding is omitted, the unicode characters are not handled correctly:

Star1 = <<"★">>.
> <<5>>
io:format("~ts~n", [Star1]).
> ^E
> ok

Is there a way that I can create literal binary strings like this without having to specify /utf8 in every binary I create? My code has quite a few binaries like this and things have become quite cluttered. Is there a way to set some sort of default encoding for binaries?

like image 339
Stratus3D Avatar asked Jun 19 '14 20:06

Stratus3D


People also ask

Is UTF-8 a codepage?

UTF-8 is the universal code page for internationalization and is able to encode the entire Unicode character set. It is used pervasively on the web, and is the default for *nix-based platforms.

What is UTF-8 encoding?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”


1 Answers

This is probably a result of the ambiguity of Erlang strings and lists. When you enter <<"★">>, what Erlang is actually seeing is <<[9733]>>, which, of course, is just a list containing an integer. As such, I believe Erlang in this case would encode 9733 as an integer, most likely with 16-bits (though I could certainly be wrong on that).

The /utf8 flag indicates to Erlang that this is supposed to be a UTF8 string, and thus gives a hint to the VM about how best to encode the integer it encounters.

like image 141
Soup d'Campbells Avatar answered Oct 08 '22 19:10

Soup d'Campbells