I have a Korean string encoded as Unicode like u'정정'
. How do I know how many bytes are needed to represent this string?
I need to know the exact byte count since I'm using the string for iOS push notification and it has a limit on the size of the payload.
len('정정')
doesn't work because that returns the number of characters, not the number of bytes.
You need to know what encoding you want to measure your byte size in:
>>> print u'\uC815\uC815'
정정
>>> print len(u'\uC815\uC815')
2
>>> print len(u'\uC815\uC815'.encode('UTF-8'))
6
>>> print len(u'\uC815\uC815'.encode('UTF-16-LE'))
4
>>> print len(u'\uC815\uC815'.encode('UTF-16'))
6
>>> print len(u'\uC815\uC815'.encode('UTF-32-LE'))
8
>>> print len(u'\uC815\uC815'.encode('UTF-32'))
12
You really want to review the Python Unicode HOWTO to fully appreciate the difference between a unicode object and it's byte encoding.
Another excellent article is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky (one of the people behind Stack Overflow).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With