When given a string containing CJK characters, String.length returns the wrong number of characters in the string because it counts the number of bytes. For example:
# String.length "第1";;
- : int = 4
There are two characters in the string, but String.length returns 4 (which is the number of bytes in the string).
How can I get the real length of a string that contains CJK characters?
If you want to count the number of extended grapheme cluster (aka graphical characters), you can use Uuseg to do the segmentation:
let len = Uuseg_string.fold_utf_8 `Grapheme_cluster (fun x _ -> x + 1) 0
;; len "春"
1
which has the advantage of still being accurate in presence of non-precomposed characters like decomposed jamo in korean:
;; len "\u{1112}\u{1161}\u{11AB}"
1
which is the correct result since the previous strings should be displayed as 한 even if it is written with 3 unicode scalar values.
As stated in the comments, OCaml does not have native support for any particular encoding, hence the length being the number of bytes.
Now, assuming you are using Utf8 encoding (which is the easiest way to mix ascii and CJK AFAIK), there are a few ways to calculate that size.
As an example, using the very lightweight Uutf library [EDIT] as octachron pointed out this returns the length in scalar values and not in characters, you should use octachron's answer.
let utf8_length s = (* returns the number of unicode scalar values *)
let decoder = Uutf.decoder ~encoding:`UTF_8 (`String s) in
let rec loop () = match Uutf.decode decoder with | `End -> () | _ -> loop () in
loop ();
Uutf.decoder_count decoder
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With