Having an UTF-8 string like this:
mystring = "işğüı"
is it possible to get its (in memory) size in Bytes with Python (2.5)?
If you want the size of the string in bytes, you can use the getsizeof() method from the sys module.
Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.
So a string size is 18 + (2 * number of characters) bytes. (In reality, another 2 bytes is sometimes used for packing to ensure 32-bit alignment, but I'll ignore that). 2 bytes is needed for each character, since .
UTF-8 uses one byte to represent code points from 0-127. These first 128 Unicode code points correspond one-to-one with ASCII character mappings, so ASCII characters are also valid UTF-8 characters.
Assuming you mean the number of UTF-8 bytes (and not the extra bytes that Python requires to store the object), it’s the same as for the length of any other string. A string literal in Python 2.x is a string of encoded bytes, not Unicode characters.
Byte strings:
>>> mystring = "işğüı"
>>> print "length of {0} is {1}".format(repr(mystring), len(mystring))
length of 'i\xc5\x9f\xc4\x9f\xc3\xbc\xc4\xb1' is 9
Unicode strings:
>>> myunicode = u"işğüı"
>>> print "length of {0} is {1}".format(repr(myunicode), len(myunicode))
length of u'i\u015f\u011f\xfc\u0131' is 5
It’s good practice to maintain all of your strings in Unicode, and only encode when communicating with the outside world. In this case, you could use len(myunicode.encode('utf-8'))
to find the size it would be after encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With