Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

length of string in python3.5 with different encode

I tried this in python to get the length of a string in bytes.

>>> s = 'a'
>>> s.encode('utf-8')
b'a'
>>> s.encode('utf-16')
b'\xff\xfea\x00'
>>> s.encode('utf-32')
b'\xff\xfe\x00\x00a\x00\x00\x00'
>>> len(s.encode('utf-8'))
1
>>> len(s.encode('utf-16'))
4
>>> len(s.encode('utf-32'))
8

utf-8 uses one byte to store an ascii character, as expected, but why does utf-16 use 4 bytes? What is len() measuring exactly?

like image 607
Z-Jiang Avatar asked Mar 08 '23 09:03

Z-Jiang


2 Answers

TL;DR:

UTF-8 : 1 byte 'a'
UTF-16: 2 bytes 'a' + 2 bytes BOM
UTF-32: 4 bytes 'a' + 4 bytes BOM
  • UTF-8 is a variable length encoding, and characters may be encoded with lengths between 1 to 4 bytes. It was designed to match ASCII for the first 128 characters, so an 'a' is a single byte width.

  • UTF-16 is a variable-length encoding; code points are encoded with one or two 16-bit code units (i.e. 2 or 4 bytes), an 'a' is 2 bytes wide.

  • UTF-32 is fixed width, exactly 32 bits per code point, each and every character is 4 bytes wide so an 'a' is 4 bytes wide.

For the lengths of an "a" encoded in UTF-8, UTF-16, UTF-32, you may expect to see results of 1, 2, 4 respectively. The actual results of 1, 4, 8 are inflated because in the last two cases the output is including the BOM - that \xff\xfe thing is the byte order mark, used to indicate the endianness of the data.

The unicode standard permits the BOM in UTF-8, but neither requires nor recommends its use (it has no meaning there), which is why you don't see any BOM in the first example. The UTF-16 BOM is 2 bytes wide and the UTF-32 BOM is 4 bytes wide (actually it's just the same as a UTF-16 BOM, plus some padding nulls).

>>> 'a'.encode('utf-16')  # length 4: 2 bytes BOM + 2 bytes a
b'\xff\xfea\x00'
  BOM.....a....
>>> 'aaa'.encode('utf-16')  # length 8: 2 bytes BOM + 3*2 bytes of a
b'\xff\xfea\x00a\x00a\x00'
  BOM.....a....a....a....

Seeing the BOM in the data might be clearer if you look at raw bits using the bitstring module:

>>> # pip install bitstring
>>> from bitstring import Bits
>>> Bits(bytes='a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> Bits(bytes='aaa'.encode('utf-32')).bin
'11111111111111100000000000000000011000010000000000000000000000000110000100000000000000000000000001100001000000000000000000000000'
 BOM.............................a...............................a...............................a...............................
like image 50
wim Avatar answered Mar 23 '23 01:03

wim


The reason your lengths look weird is that the UTF-16 and UTF-32 encodings are appending a byte order mark to the beginning of your string during the encoding. That's why the lengths of the strings appear to be double what you'd expect. They're using two code points. The byte order mark tells you a few things (endianness and encoding being the main ones). So basically len is functioning as you'd expect (it's measuring the number of bytes used in the encoded representation).

like image 34
Saedeas Avatar answered Mar 23 '23 00:03

Saedeas