I have downloaded Camomile and installed it and I am good to go for using it.
The question is how should I use it?
in ocaml, for default string, i just do let s = "a string";;
but what with Camomile
?
for example, if I want to construct a utf8
string こんにちは
(a Japanese word for hello, copied from google translate), how should I do it with Camomile
?
Edit:
It is funny that it is said that ocaml
can't support utf8
, but I tried this code
let s = "你好";;
let _ = print_string s;print_string "\n";;
it worked in ocaml. But why?? 你好
is a Chinese, how can ocaml can print it and handle it if everyone says ocaml 4.00.1
cannot handle utf8
?
UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.
UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
Within an identifier, you would also want to allow characters >= 0x80, which is the range of UTF-8 continuation bytes. Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.
Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.
Here is a short presentation of the different actors:
ASCII is both a set of characters (there are 127 of them) and a code to represent them (on 7 bits).
Unicode is a set of characters (there are a lot more than 127).
UTF-8 is a code to represent unicode characters.
Your terminal. It interprets bytes output by your program as UTF-8 encoded characters and displays the corresponding unicode characters.
OCaml process sequences of bytes (OCaml uses the name char
but it is misleading and the name byte
would be more appropriate).
So if OCaml outputs the sequence of bytes corresponding to the UTF-8 code for "你好"
, your terminal will interpret it as a utf-8 string and will output 你好
. But for OCaml, "你好"
is just a sequence of 6 bytes.
TörökEdwin told you everything you need to know, I think. UTF-8 is specifically designed as a way to store Unicode values (codepoints) in a series of 8-bit bytes when the code is used to dealing with ASCII C strings. Since OCaml strings are a series of 8-bit bytes there's no problem storing a UTF-8 value there. If the program you use to create your OCaml source handles UTF-8, then it will have no trouble creating a string containing a UTF-8 value. You don't need to do anything special to get that to happen. (As I said I've done this many times myself.)
If you don't need to process the value, then the OCaml I/O functions can also write out such a value (or read one in), and if the encoding of your display is UTF-8 (which is what I use), it will display correctly. But most often you will need to process your values. If you change your code to (for example) just write out the length of the string, you might start to see why you would need a special library for handling UTF-8.
If you wonder why a certain Unicode string is represented as a certain series of bytes in the UTF-8 encoding you just need to read up on UTF-8. The Wikipedia article (UTF-8) might be a reasonable place to start.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With