Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating the length of a Japanese multibyte string with half-width kana in PHP

So I have a UTF-8 encoded string which can contain full-width kanji, full-width kana, half-width kana, romaji, numbers or kawaii japanese symbols like ★ or ♥.

If I want the length I use mb_strlen() and it counts each of these as 1 in length. Which is fine for most purposes.

But, I've been asked (by a Japanese client) to only count half-width kana as 0.5 (for the purpose of maxlength of a text field) because apparently thats how Japanese websites do it. I do this using mb_strwidth() which counts full-width as 2, and half-width as 1, then i just divide by 2.

However this method also counts romaji characters as 1 so something like Chocアイス would count as 7 .. then i'd divide by 2 to account for kanji and I'd get 3.5. but I actually want 5.5 (4 for the Romaji + 1.5 for the 3 half-width kana).

// EDIT: some more info: any character (even non-kana) which has both a full and a half should be 1 for the full-width and 0.5 for the half-width. for example, characters like ¥、3@( should all be 1, but characters like ¥,3@( should all be 0.5

// EXTRA EDIT: symbols like ☆ and ♥ should be 1, but the mb_strwidth/2 method return them as 0.5

Is there a standard way that Japanese systems count string length? Or does everyone just loop thru their strings and count the characters which don't match the standard width rules?

like image 613
icchanobot Avatar asked Apr 12 '11 09:04

icchanobot


1 Answers

One way is to convert the half-width katakana to full-width and subtract the difference in width from the original length:

$raw = 'Chocアイス';
$full = mb_convert_kana($raw, 'K');
$len = mb_strlen($raw) - (mb_strwidth($full) - mb_strwidth($raw))/2;
assert($len === 5.5);

However, are you sure that you should be considering basic latin characters as full-width? There do exist full-width varieties of basic latin characters too---that is, should Choc be considered the same as Choc?

Usually, characters like "A" and "ア" would have a width of 1, but "A" and "ア" would have a width of 2 (which is what mb_strwidth does). I'd be cautious about having to hack around that.


Given your edit, mb_strwidth (or mb_strwidth/2) does exactly what you want.

like image 129
一二三 Avatar answered Sep 24 '22 14:09

一二三