Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I determine the byte length of a utf-8 encoded string in Python?

I am working with Amazon S3 uploads and am having trouble with key names being too long. S3 limits the length of the key by bytes, not characters.

From the docs:

The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.

I also attempt to embed metadata in the file name, so I need to be able to calculate the current byte length of the string using Python to make sure the metadata does not make the key too long (in which case I would have to use a separate metadata file).

How can I determine the byte length of the utf-8 encoded string? Again, I am not interested in the character length... rather the actual byte length used to store the string.

like image 706
user319862 Avatar asked Jul 16 '11 02:07

user319862


2 Answers

def utf8len(s):
    return len(s.encode('utf-8'))

Works fine in Python 2 and 3.

like image 165
Dietrich Epp Avatar answered Nov 15 '22 10:11

Dietrich Epp


Use the string 'encode' method to convert from a character-string to a byte-string, then use len() like normal:

>>> s = u"¡Hola, mundo!"                                                      
>>> len(s)                                                                    
13 # characters                                                                             
>>> len(s.encode('utf-8'))   
14 # bytes
like image 11
Mark Reed Avatar answered Nov 15 '22 09:11

Mark Reed