Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get a size of an UTF-8 string in Bytes with Python

Tags:

python

Having an UTF-8 string like this:

mystring = "işğüı"

is it possible to get its (in memory) size in Bytes with Python (2.5)?

like image 807
systempuntoout Avatar asked Oct 01 '10 19:10

systempuntoout


People also ask

How do you find the byte size of a string in Python?

If you want the size of the string in bytes, you can use the getsizeof() method from the sys module.

How many bytes is a string in UTF-8?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

How do you find the length of a string in a byte?

So a string size is 18 + (2 * number of characters) bytes. (In reality, another 2 bytes is sometimes used for packing to ensure 32-bit alignment, but I'll ignore that). 2 bytes is needed for each character, since .

Is UTF-8 a byte?

UTF-8 uses one byte to represent code points from 0-127. These first 128 Unicode code points correspond one-to-one with ASCII character mappings, so ASCII characters are also valid UTF-8 characters.


1 Answers

Assuming you mean the number of UTF-8 bytes (and not the extra bytes that Python requires to store the object), it’s the same as for the length of any other string. A string literal in Python 2.x is a string of encoded bytes, not Unicode characters.

Byte strings:

>>> mystring = "işğüı"
>>> print "length of {0} is {1}".format(repr(mystring), len(mystring))
length of 'i\xc5\x9f\xc4\x9f\xc3\xbc\xc4\xb1' is 9

Unicode strings:

>>> myunicode = u"işğüı"
>>> print "length of {0} is {1}".format(repr(myunicode), len(myunicode))
length of u'i\u015f\u011f\xfc\u0131' is 5

It’s good practice to maintain all of your strings in Unicode, and only encode when communicating with the outside world. In this case, you could use len(myunicode.encode('utf-8')) to find the size it would be after encoding.

like image 190
Josh Lee Avatar answered Sep 18 '22 18:09

Josh Lee