Let's say I have a string in Python: <pre class="prettyprint"><code>>>> s = 'python' >>> len(s) 6 </code></pre> Now I <code>encode</code> this string like this: <pre class="prettyprint"><code>>>> b = s.encode('utf-8') >>> b16 = s.encode('utf-16') >>> b32 = s.encode('utf-32') </code></pre> What I get from above operations is a bytes array -- that is, <code>b</code>, <code>b16</code> and <code>b32</code> are just arrays of bytes (each byte being 8-bit long of course). But we encoded the string. So, what does this mean? How do we attach the notion of "encoding" with the raw array of bytes? The answer lies in the fact that each of these array of bytes is generated in a particular way. Let's look at these arrays: <pre class="prettyprint"><code>>>> [hex(x) for x in b] ['0x70', '0x79', '0x74', '0x68', '0x6f', '0x6e'] >>> len(b) 6 </code></pre> This array indicates that for each character we have one byte (because all the characters fall below 127). Hence, we can say that "encoding" the string to 'utf-8' collects each character's corresponding code-point and puts it into the array. If the code point can not fit in one byte then utf-8 consumes two bytes. Hence utf-8 consumes least number of bytes possible. <pre class="prettyprint"><code>>>> [hex(x) for x in b16] ['0xff', '0xfe', '0x70', '0x0', '0x79', '0x0', '0x74', '0x0', '0x68', '0x0', '0x6f', '0x0', '0x6e', '0x0'] >>> len(b16) 14 # (2 + 6*2) </code></pre> Here we can see that "encoding to utf-16" first puts a two byte BOM (<code>FF FE</code>) into the bytes array, and after that, for each character it puts two bytes into the array. (In our case, the second byte is always zero) <pre class="prettyprint"><code>>>> [hex(x) for x in b32] ['0xff', '0xfe', '0x0', '0x0', '0x70', '0x0', '0x0', '0x0', '0x79', '0x0', '0x0', '0x0', '0x74', '0x0', '0x0', '0x0', '0x68', '0x0', '0x0', '0x0', '0x6f', '0x0', '0x0', '0x0', '0x6e', '0x0', '0x0', '0x0'] >>> len(b32) 28 # (2+ 6*4 + 2) </code></pre> In the case of "encoding in utf-32", we first put the BOM, then for each character we put four bytes, and lastly we put two zero bytes into the array. Hence, we can say that the "encoding process" collects 1 2 or 4 bytes (depending on the encoding name) for each character in the string and prepends and appends more bytes to them to create the final result array of bytes. Now, my questions: <ul> <li>Is my understanding of the encoding process correct or am I missing something?</li> <li>We can see that the memory representation of the variables <code>b</code>, <code>b16</code> and <code>b32</code> is actually a list of bytes. What is the memory representation of the string? Exactly what is stored in memory for a string?</li> <li>We know that when we do an <code>encode()</code>, each character's corresponding code point is collected (code point corresponding to the encoding name) and put into an array or bytes. What exactly happens when we do a <code>decode()</code>?</li> <li>We can see that in utf-16 and utf-32, a BOM is prepended, but why are two zero bytes appended in the utf-32 encoding?</li> </ul>

<ol> <li>Your understanding is essentially correct as far as it goes, although it's not really "1, 2, or 4 bytes". For UTF-32 it will be 4 bytes. For UTF-16 and UTF-8 the number of bytes depends on the character being encoded. For UTF-16 it will be either 2 or 4 bytes. For UTF-8 it may be 1, 2, 3, or 4 bytes. But yes, basically encoding takes the unicode code point and maps it to a sequence of bytes. How this mapping is done depends on the encoding. For UTF-32 it is just a straight hex representation of the code point number. For UTF-16 it is usually that, but will be a bit different for unusual characters (outside the base multilingual plane). For UTF-8 the encoding is more complex (see Wikipedia.) As for the extra bytes at the beginning, those are byte-order markers that determine which order the pieces of the code point come in UTF-16 or UTF-32.</li> <li>I guess you could look at the internals, but the point of the string type (or unicode type in Python 2) is to shield you from that information, just like the point of a Python list is to shield you from having to manipulate the raw memory structure of that list. The string data type exists so you can work with unicode code points without worrying about the memory representation. If you want to work with the raw bytes, encode the string.</li> <li>When you do a decode, it basically scans the string, looking for chunks of bytes. The encoding schemes essentially provide "clues" that allow the decoder to see when one character ends and another begins. So the decoder scans along and uses these clues to find the boundaries between characters, then looks up each piece to see what character it represents in that encoding. You can look up the individual encodings on Wikipedia or the like if you want to see the details of how each encoding maps code points back and forth with bytes.</li> <li>The two zero bytes are part of the byte-order marker for UTF-32. Because UTF-32 always uses 4 bytes per code point, the BOM is four bytes as well. Basically the FFFE marker that you see in UTF-16 is zero-padded with two extra zero bytes. These byte order markers indicate whether the numbers making up the code point are in order from largest to smallest or smallest to largest. Basically it's like the choice of whether to write the number "one thousand two hundred and thirty four" as 1234 or 4321. Different computer architectures make different choices on this matter.</li> </ol>

Python 3: Demystifying encode and decode methods

Tags:

python

python-3.x

encoding

unicode

Let's say I have a string in Python:

>>> s = 'python'
>>> len(s)
6

Now I encode this string like this:

>>> b = s.encode('utf-8')
>>> b16 = s.encode('utf-16')
>>> b32 = s.encode('utf-32')

What I get from above operations is a bytes array -- that is, b, b16 and b32 are just arrays of bytes (each byte being 8-bit long of course).

But we encoded the string. So, what does this mean? How do we attach the notion of "encoding" with the raw array of bytes?

The answer lies in the fact that each of these array of bytes is generated in a particular way. Let's look at these arrays:

>>> [hex(x) for x in b]
['0x70', '0x79', '0x74', '0x68', '0x6f', '0x6e']

>>> len(b)
6

This array indicates that for each character we have one byte (because all the characters fall below 127). Hence, we can say that "encoding" the string to 'utf-8' collects each character's corresponding code-point and puts it into the array. If the code point can not fit in one byte then utf-8 consumes two bytes. Hence utf-8 consumes least number of bytes possible.

>>> [hex(x) for x in b16]
['0xff', '0xfe', '0x70', '0x0', '0x79', '0x0', '0x74', '0x0', '0x68', '0x0', '0x6f', '0x0', '0x6e',  '0x0']

>>> len(b16)
14     # (2 + 6*2)

Here we can see that "encoding to utf-16" first puts a two byte BOM (FF FE) into the bytes array, and after that, for each character it puts two bytes into the array. (In our case, the second byte is always zero)

>>> [hex(x) for x in b32]
['0xff', '0xfe', '0x0', '0x0', '0x70', '0x0', '0x0', '0x0', '0x79', '0x0', '0x0', '0x0', '0x74', '0x0', '0x0', '0x0', '0x68', '0x0', '0x0', '0x0', '0x6f', '0x0', '0x0', '0x0', '0x6e', '0x0', '0x0', '0x0']

>>> len(b32)
28     # (2+ 6*4 + 2)

In the case of "encoding in utf-32", we first put the BOM, then for each character we put four bytes, and lastly we put two zero bytes into the array.

Hence, we can say that the "encoding process" collects 1 2 or 4 bytes (depending on the encoding name) for each character in the string and prepends and appends more bytes to them to create the final result array of bytes.

Now, my questions:

Is my understanding of the encoding process correct or am I missing something?
We can see that the memory representation of the variables b, b16 and b32 is actually a list of bytes. What is the memory representation of the string? Exactly what is stored in memory for a string?
We know that when we do an encode(), each character's corresponding code point is collected (code point corresponding to the encoding name) and put into an array or bytes. What exactly happens when we do a decode()?
We can see that in utf-16 and utf-32, a BOM is prepended, but why are two zero bytes appended in the utf-32 encoding?

407

asked Nov 20 '12 08:11

treecoder

2 Answers

First of all, UTF-32 is a 4-byte encoding, so its BOM is a four byte sequence too:

>>> import codecs
>>> codecs.BOM_UTF32
b'\xff\xfe\x00\x00'

And because different computer architectures treat byte orders differently (called Endianess), there are two variants of the BOM, little and big endian:

>>> codecs.BOM_UTF32_LE
b'\xff\xfe\x00\x00'
>>> codecs.BOM_UTF32_BE
b'\x00\x00\xfe\xff'

The purpose of the BOM is to communicate that order to the decoder; read the BOM and you know if it is big or little endian. So, those last two null bytes in your UTF-32 string are part of the last encoded character.

The UTF-16 BOM is thus similar, in that there are two variants:

>>> codecs.BOM_UTF16
b'\xff\xfe'
>>> codecs.BOM_UTF16_LE
b'\xff\xfe'
>>> codecs.BOM_UTF16_BE
b'\xfe\xff'

It depends on your computer architecture which one is used by default.

UTF-8 doesn't need a BOM at all; UTF-8 uses 1 or more bytes per character (adding bytes as needed to encode more complex values), but the order of those bytes is defined in the standard. Microsoft deemed it necessary to introduce a UTF-8 BOM anyway (so its Notepad application could detect UTF-8), but since the order of the BOM never varies its use is discouraged.

As for what is stored by Python for unicode strings; that actually changed in Python 3.3. Before 3.3, internally at the C level, Python either stored UTF16 or UTF32 byte combinations, depending on whether or not Python was compiled with wide character support (see How to find out if Python is compiled with UCS-2 or UCS-4?, UCS-2 is essentially UTF-16 and UCS-4 is UTF-32). So, each character either takes 2 or 4 bytes of memory.

As of Python 3.3, the internal representation uses the minimal number of bytes required to represent all characters in the string. For plain ASCII and Latin1-encodable text 1 byte is used, for the rest of the BMP 2 bytes are used, and text containing characters beyond that 4 bytes are used. Python switches between the formats as needed. Thus, storage has become a lot more efficient for most cases. For more detail see What's New in Python 3.3.

I can strongly recommend you read up on Unicode and Python with:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Python Unicode HOWTO

answered Oct 07 '22 03:10

Martijn Pieters

Your understanding is essentially correct as far as it goes, although it's not really "1, 2, or 4 bytes". For UTF-32 it will be 4 bytes. For UTF-16 and UTF-8 the number of bytes depends on the character being encoded. For UTF-16 it will be either 2 or 4 bytes. For UTF-8 it may be 1, 2, 3, or 4 bytes. But yes, basically encoding takes the unicode code point and maps it to a sequence of bytes. How this mapping is done depends on the encoding. For UTF-32 it is just a straight hex representation of the code point number. For UTF-16 it is usually that, but will be a bit different for unusual characters (outside the base multilingual plane). For UTF-8 the encoding is more complex (see Wikipedia.) As for the extra bytes at the beginning, those are byte-order markers that determine which order the pieces of the code point come in UTF-16 or UTF-32.
I guess you could look at the internals, but the point of the string type (or unicode type in Python 2) is to shield you from that information, just like the point of a Python list is to shield you from having to manipulate the raw memory structure of that list. The string data type exists so you can work with unicode code points without worrying about the memory representation. If you want to work with the raw bytes, encode the string.
When you do a decode, it basically scans the string, looking for chunks of bytes. The encoding schemes essentially provide "clues" that allow the decoder to see when one character ends and another begins. So the decoder scans along and uses these clues to find the boundaries between characters, then looks up each piece to see what character it represents in that encoding. You can look up the individual encodings on Wikipedia or the like if you want to see the details of how each encoding maps code points back and forth with bytes.
The two zero bytes are part of the byte-order marker for UTF-32. Because UTF-32 always uses 4 bytes per code point, the BOM is four bytes as well. Basically the FFFE marker that you see in UTF-16 is zero-padded with two extra zero bytes. These byte order markers indicate whether the numbers making up the code point are in order from largest to smallest or smallest to largest. Basically it's like the choice of whether to write the number "one thousand two hundred and thirty four" as 1234 or 4321. Different computer architectures make different choices on this matter.

answered Oct 07 '22 01:10

BrenBarn

Related questions
                            
                                Can I define a scope anywhere in Python?
                            
                                Using Amazon S3 with Heroku, Python, and Flask
                            
                                How to combine callLater and addCallback?
                            
                                Mac OS X, pip: specify compiler for packages containing C libraries
                            
                                os.walk() in reverse?
                            
                                How to get all child components of QWidget in pyside/pyqt/qt?
                            
                                Django and models with multiple foreign keys
                            
                                Python - create dictionary from list of dictionaries
                            
                                Remove element from tuple in a list
                            
                                How do I check if a user left the 'input' or 'raw_input' prompt empty?
                            
                                Python asks for older paths on mac after deleting duplicate python installation
                            
                                How to use can_add_related in Django Admin
                            
                                Change icon for a cx_Freeze script
                            
                                Difference between calling sys.exit() and throwing exception
                            
                                Run multiple scrapy spiders at once using scrapyd
                            
                                running c++ code from python
                            
                                Matplotlib LaTeX: Inconsistent Behaviour with Greek Letters (Specifically \rho)
                            
                                Python - merge time and date [duplicate]
                            
                                After creating python exe file with cx_freeze the file doesn't do anything
                            
                                Using sample_weight in GridSearchCV

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With