Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle multibyte string in Python

There are multibyte string functions in PHP to handle multibyte string (e.g:CJK script). For example, I want to count how many letters in a multi bytes string by using len function in python, but it return an inaccurate result (i.e number of bytes in this string)

japanese = "桜の花びらたち"
print japanese
print len(japanese)#return 21 instead of 7

Is there any package or function like mb_strlen in PHP?

like image 396
hungneox Avatar asked Dec 01 '11 18:12

hungneox


People also ask

What is a multibyte string?

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.

Is multibyte character set?

Multibyte Character Set (MBCS): A character set encoded with a variable number of bytes for each character. Many large character sets have been defined as multi-byte character sets in order to keep strict compatibility with the standards of the ASCII subset, the ISO and IEC 2022.

How do you write a multibyte character?

If supported by your input device, multibyte characters can be entered directly. Otherwise, you can enter any multibyte character in the ASCII form \[N], where N is the 2-, 4-, 6-, 7-, or 8-digit hexadecimal encoding for the character.

What is Unicode string in Python?

To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.


1 Answers

Use Unicode strings:

# Encoding: UTF-8

japanese = u"桜の花びらたち"
print japanese
print len(japanese)

Note the u in front of the string.

To convert a bytestring into Unicode, use decode: "桜の花びらたち".decode('utf-8')

like image 72
Petr Viktorin Avatar answered Sep 18 '22 20:09

Petr Viktorin