Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use Camomile for UTF8 strings in ocaml?

Tags:

ocaml

camomile

I have downloaded Camomile and installed it and I am good to go for using it.

The question is how should I use it?

in ocaml, for default string, i just do let s = "a string";;

but what with Camomile?

for example, if I want to construct a utf8 string こんにちは (a Japanese word for hello, copied from google translate), how should I do it with Camomile?


Edit:

It is funny that it is said that ocaml can't support utf8, but I tried this code

let s = "你好";;

let _ = print_string s;print_string "\n";;

it worked in ocaml. But why?? 你好 is a Chinese, how can ocaml can print it and handle it if everyone says ocaml 4.00.1 cannot handle utf8?

like image 539
Jackson Tale Avatar asked Apr 24 '13 16:04

Jackson Tale


People also ask

Does UTF-8 use 8bits?

UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.

What are UTF-8 strings?

UTF-8 encodes a character into a binary string of one, two, three, or four bytes. UTF-16 encodes a Unicode character into a string of either two or four bytes. This distinction is evident from their names. In UTF-8, the smallest binary representation of a character is one byte, or eight bits.

Are C strings UTF-8?

Within an identifier, you would also want to allow characters >= 0x80, which is the range of UTF-8 continuation bytes. Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.

Is UTF-8 valid?

Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.


Video Answer


2 Answers

Here is a short presentation of the different actors:

  • ASCII is both a set of characters (there are 127 of them) and a code to represent them (on 7 bits).

  • Unicode is a set of characters (there are a lot more than 127).

  • UTF-8 is a code to represent unicode characters.

  • Your terminal. It interprets bytes output by your program as UTF-8 encoded characters and displays the corresponding unicode characters.

  • OCaml process sequences of bytes (OCaml uses the name char but it is misleading and the name byte would be more appropriate).

So if OCaml outputs the sequence of bytes corresponding to the UTF-8 code for "你好", your terminal will interpret it as a utf-8 string and will output 你好. But for OCaml, "你好" is just a sequence of 6 bytes.

like image 108
Thomash Avatar answered Sep 22 '22 18:09

Thomash


TörökEdwin told you everything you need to know, I think. UTF-8 is specifically designed as a way to store Unicode values (codepoints) in a series of 8-bit bytes when the code is used to dealing with ASCII C strings. Since OCaml strings are a series of 8-bit bytes there's no problem storing a UTF-8 value there. If the program you use to create your OCaml source handles UTF-8, then it will have no trouble creating a string containing a UTF-8 value. You don't need to do anything special to get that to happen. (As I said I've done this many times myself.)

If you don't need to process the value, then the OCaml I/O functions can also write out such a value (or read one in), and if the encoding of your display is UTF-8 (which is what I use), it will display correctly. But most often you will need to process your values. If you change your code to (for example) just write out the length of the string, you might start to see why you would need a special library for handling UTF-8.

If you wonder why a certain Unicode string is represented as a certain series of bytes in the UTF-8 encoding you just need to read up on UTF-8. The Wikipedia article (UTF-8) might be a reasonable place to start.

like image 33
Jeffrey Scofield Avatar answered Sep 24 '22 18:09

Jeffrey Scofield