Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the number of bytes needed for a Unicode string

I have a Korean string encoded as Unicode like u'정정'. How do I know how many bytes are needed to represent this string?

I need to know the exact byte count since I'm using the string for iOS push notification and it has a limit on the size of the payload.

len('정정') doesn't work because that returns the number of characters, not the number of bytes.

like image 878
jasondinh Avatar asked Aug 06 '12 17:08

jasondinh


1 Answers

You need to know what encoding you want to measure your byte size in:

>>> print u'\uC815\uC815'
정정
>>> print len(u'\uC815\uC815')
2
>>> print len(u'\uC815\uC815'.encode('UTF-8'))
6
>>> print len(u'\uC815\uC815'.encode('UTF-16-LE'))
4
>>> print len(u'\uC815\uC815'.encode('UTF-16'))
6
>>> print len(u'\uC815\uC815'.encode('UTF-32-LE'))
8
>>> print len(u'\uC815\uC815'.encode('UTF-32'))
12

You really want to review the Python Unicode HOWTO to fully appreciate the difference between a unicode object and it's byte encoding.

Another excellent article is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky (one of the people behind Stack Overflow).

like image 171
Martijn Pieters Avatar answered Sep 28 '22 15:09

Martijn Pieters