There are multibyte string functions in PHP to handle multibyte string (e.g:CJK script). For example, I want to count how many letters in a multi bytes string by using len
function in python, but it return an inaccurate result (i.e number of bytes in this string)
japanese = "桜の花びらたち"
print japanese
print len(japanese)#return 21 instead of 7
Is there any package or function like mb_strlen in PHP?
A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.
Multibyte Character Set (MBCS): A character set encoded with a variable number of bytes for each character. Many large character sets have been defined as multi-byte character sets in order to keep strict compatibility with the standards of the ASCII subset, the ISO and IEC 2022.
If supported by your input device, multibyte characters can be entered directly. Otherwise, you can enter any multibyte character in the ASCII form \[N], where N is the 2-, 4-, 6-, 7-, or 8-digit hexadecimal encoding for the character.
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.
Use Unicode strings:
# Encoding: UTF-8
japanese = u"桜の花びらたち"
print japanese
print len(japanese)
Note the u
in front of the string.
To convert a bytestring into Unicode, use decode
: "桜の花びらたち".decode('utf-8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With