I tried this in python to get the length of a string in bytes.
>>> s = 'a'
>>> s.encode('utf-8')
b'a'
>>> s.encode('utf-16')
b'\xff\xfea\x00'
>>> s.encode('utf-32')
b'\xff\xfe\x00\x00a\x00\x00\x00'
>>> len(s.encode('utf-8'))
1
>>> len(s.encode('utf-16'))
4
>>> len(s.encode('utf-32'))
8
utf-8 uses one byte to store an ascii character, as expected, but why does utf-16 use 4 bytes? What is len() measuring exactly?
TL;DR:
UTF-8 : 1 byte 'a'
UTF-16: 2 bytes 'a' + 2 bytes BOM
UTF-32: 4 bytes 'a' + 4 bytes BOM
UTF-8 is a variable length encoding, and characters may be encoded with lengths between 1 to 4 bytes. It was designed to match ASCII for the first 128 characters, so an 'a' is a single byte width.
UTF-16 is a variable-length encoding; code points are encoded with one or two 16-bit code units (i.e. 2 or 4 bytes), an 'a' is 2 bytes wide.
UTF-32 is fixed width, exactly 32 bits per code point, each and every character is 4 bytes wide so an 'a' is 4 bytes wide.
For the lengths of an "a" encoded in UTF-8, UTF-16, UTF-32, you may expect to see results of 1, 2, 4 respectively. The actual results of 1, 4, 8 are inflated because in the last two cases the output is including the BOM - that \xff\xfe
thing is the byte order mark, used to indicate the endianness of the data.
The unicode standard permits the BOM in UTF-8, but neither requires nor recommends its use (it has no meaning there), which is why you don't see any BOM in the first example. The UTF-16 BOM is 2 bytes wide and the UTF-32 BOM is 4 bytes wide (actually it's just the same as a UTF-16 BOM, plus some padding nulls).
>>> 'a'.encode('utf-16') # length 4: 2 bytes BOM + 2 bytes a
b'\xff\xfea\x00'
BOM.....a....
>>> 'aaa'.encode('utf-16') # length 8: 2 bytes BOM + 3*2 bytes of a
b'\xff\xfea\x00a\x00a\x00'
BOM.....a....a....a....
Seeing the BOM in the data might be clearer if you look at raw bits using the bitstring
module:
>>> # pip install bitstring
>>> from bitstring import Bits
>>> Bits(bytes='a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> Bits(bytes='aaa'.encode('utf-32')).bin
'11111111111111100000000000000000011000010000000000000000000000000110000100000000000000000000000001100001000000000000000000000000'
BOM.............................a...............................a...............................a...............................
The reason your lengths look weird is that the UTF-16 and UTF-32 encodings are appending a byte order mark to the beginning of your string during the encoding. That's why the lengths of the strings appear to be double what you'd expect. They're using two code points. The byte order mark tells you a few things (endianness and encoding being the main ones). So basically len is functioning as you'd expect (it's measuring the number of bytes used in the encoded representation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With