Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length of string that contains CJK characters

When given a string containing CJK characters, String.length returns the wrong number of characters in the string because it counts the number of bytes. For example:

# String.length "第1";;
- : int = 4

There are two characters in the string, but String.length returns 4 (which is the number of bytes in the string).

How can I get the real length of a string that contains CJK characters?

like image 419
Flux Avatar asked Nov 22 '25 12:11

Flux


2 Answers

If you want to count the number of extended grapheme cluster (aka graphical characters), you can use Uuseg to do the segmentation:

let len = Uuseg_string.fold_utf_8 `Grapheme_cluster (fun x _ -> x + 1) 0
;; len "春"

1

which has the advantage of still being accurate in presence of non-precomposed characters like decomposed jamo in korean:

 ;; len "\u{1112}\u{1161}\u{11AB}"

1

which is the correct result since the previous strings should be displayed as even if it is written with 3 unicode scalar values.

like image 70
octachron Avatar answered Nov 25 '25 09:11

octachron


As stated in the comments, OCaml does not have native support for any particular encoding, hence the length being the number of bytes.

Now, assuming you are using Utf8 encoding (which is the easiest way to mix ascii and CJK AFAIK), there are a few ways to calculate that size.

As an example, using the very lightweight Uutf library [EDIT] as octachron pointed out this returns the length in scalar values and not in characters, you should use octachron's answer.

let utf8_length s = (* returns the number of unicode scalar values *)
 let decoder = Uutf.decoder ~encoding:`UTF_8 (`String s) in
 let rec loop () = match Uutf.decode decoder with | `End -> () | _ -> loop () in
 loop ();
 Uutf.decoder_count decoder
like image 44
PatJ Avatar answered Nov 25 '25 09:11

PatJ