Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length of strings in unicode are different

Tags:

php

unicode

How come the length of the following strings is different although the number of characters in the strings are the same

echo strlen("馐 馑 馒 馓 馔 馕 首 馗 馘")."<BR>";
echo strlen("Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ")."<BR>";

Outputs

35
26
like image 713
Imran Omar Bukhsh Avatar asked Sep 24 '11 06:09

Imran Omar Bukhsh


2 Answers

The first batch of characters take up three bytes each, because they're way down in the 39-thousand-ish character list, whereas the second group only take two bytes each, being around 400. (The number of bytes/octets required per character are discussed in the UTF-8 wikipedia article.)

strlen counts the number of bytes taken by the string, which gives such odd results in Unicode.

like image 54
Niet the Dark Absol Avatar answered Sep 29 '22 00:09

Niet the Dark Absol


I am no PHP expert but it seems that strlen it counts bytes... there is mb_strlen which counts characters...

EDIT - for further reference on how multi-byte encoding works see http://en.wikipedia.org/wiki/Variable-width_encoding and esp. UTF8 see http://en.wikipedia.org/wiki/UTF-8 and

like image 32
Yahia Avatar answered Sep 28 '22 23:09

Yahia