<p>I have this string that has been decoded from Quoted-printable to ISO-8859-1 with the email module. This gives me strings like "\xC4pple" which would correspond to "Äpple" (Apple in Swedish). However, I can't convert those strings to UTF-8.</p> <pre class="prettyprint"><code>>>> apple = "\xC4pple" >>> apple '\xc4pple' >>> apple.encode("UTF-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) </code></pre> <p>What should I do?</p>

<p>This is a common problem, so here's a relatively thorough illustration.</p> <p>For non-unicode strings (i.e. those without <code>u</code> prefix like <code>u'\xc4pple'</code>), one must decode from the native encoding (<code>iso8859-1</code>/<code>latin1</code>, unless modified with the enigmatic <code>sys.setdefaultencoding</code> function) to <code>unicode</code>, then encode to a character set that can display the characters you wish, in this case I'd recommend <code>UTF-8</code>.</p> <p>First, here is a handy utility function that'll help illuminate the patterns of Python 2.7 string and unicode:</p> <pre class="prettyprint"><code>>>> def tell_me_about(s): return (type(s), s) </code></pre> <h3>A plain string</h3> <pre class="prettyprint"><code>>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string >>> tell_me_about(v) (<type 'str'>, '\xc4pple') >>> v '\xc4pple' # representation in memory >>> print v ?pple # map the iso-8859-1 in-memory to iso-8859-1 chars # note that '\xc4' has no representation in iso-8859-1, # so is printed as "?". </code></pre> <h3>Decoding a iso8859-1 string - convert plain string to unicode</h3> <pre class="prettyprint"><code>>>> uv = v.decode("iso-8859-1") >>> uv u'\xc4pple' # decoding iso-8859-1 becomes unicode, in memory >>> tell_me_about(uv) (<type 'unicode'>, u'\xc4pple') >>> print v.decode("iso-8859-1") Äpple # convert unicode to the default character set # (utf-8, based on sys.stdout.encoding) >>> v.decode('iso-8859-1') == u'\xc4pple' True # one could have just used a unicode representation # from the start </code></pre> <h3>A little more illustration — with “Ä”</h3> <pre class="prettyprint"><code>>>> u"Ä" == u"\xc4" True # the native unicode char and escaped versions are the same >>> "Ä" == u"\xc4" False # the native unicode char is '\xc3\x84' in latin1 >>> "Ä".decode('utf8') == u"\xc4" True # one can decode the string to get unicode >>> "Ä" == "\xc4" False # the native character and the escaped string are # of course not equal ('\xc3\x84' != '\xc4'). </code></pre> <h3>Encoding to UTF</h3> <pre class="prettyprint"><code>>>> u8 = v.decode("iso-8859-1").encode("utf-8") >>> u8 '\xc3\x84pple' # convert iso-8859-1 to unicode to utf-8 >>> tell_me_about(u8) (<type 'str'>, '\xc3\x84pple') >>> u16 = v.decode('iso-8859-1').encode('utf-16') >>> tell_me_about(u16) (<type 'str'>, '\xff\xfe\xc4\x00p\x00p\x00l\x00e\x00') >>> tell_me_about(u8.decode('utf8')) (<type 'unicode'>, u'\xc4pple') >>> tell_me_about(u16.decode('utf16')) (<type 'unicode'>, u'\xc4pple') </code></pre> <h3>Relationship between unicode and UTF and latin1</h3> <pre class="prettyprint"><code>>>> print u8 Äpple # printing utf-8 - because of the encoding we now know # how to print the characters >>> print u8.decode('utf-8') # printing unicode Äpple >>> print u16 # printing 'bytes' of u16 ��pple >>> print u16.decode('utf16') Äpple # printing unicode >>> v == u8 False # v is a iso8859-1 string; u8 is a utf-8 string >>> v.decode('iso8859-1') == u8 False # v.decode(...) returns unicode >>> u8.decode('utf-8') == v.decode('latin1') == u16.decode('utf-16') True # all decode to the same unicode memory representation # (latin1 is iso-8859-1) </code></pre> <h3>Unicode Exceptions</h3> <pre class="prettyprint"><code> >>> u8.encode('iso8859-1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) >>> u16.encode('iso8859-1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128) >>> v.encode('iso8859-1') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128) </code></pre> <p>One would get around these by converting from the specific encoding (latin-1, utf8, utf16) to unicode e.g. <code>u8.decode('utf8').encode('latin1')</code>.</p> <p>So perhaps one could draw the following principles and generalizations:</p> <ul> <li>a type <code>str</code> is a set of bytes, which may have one of a number of encodings such as Latin-1, UTF-8, and UTF-16</li> <li>a type <code>unicode</code> is a set of bytes that can be converted to any number of encodings, most commonly UTF-8 and latin-1 (iso8859-1)</li> <li>the <code>print</code> command has its own logic for encoding, set to <code>sys.stdout.encoding</code> and defaulting to UTF-8</li> <li>One must decode a <code>str</code> to unicode before converting to another encoding.</li> </ul> <p>Of course, all of this changes in Python 3.x.</p> <p>Hope that is illuminating.</p> <h3>Further reading</h3> <ul> <li> Characters vs. Bytes, by Tim Bray.</li> </ul> <p>And the very illustrative rants by Armin Ronacher:</p> <ul> <li>The Updated Guide to Unicode on Python (July 2, 2013)</li> <li>More About Unicode in Python 2 and 3 (January 5, 2014)</li> <li>UCS vs UTF-8 as Internal String Encoding (January 9, 2014)</li> <li>Everything you did not want to know about Unicode in Python 3 (May 12, 2014)</li> </ul>

<p>Try decoding it first, then encoding:</p> <pre class="prettyprint"><code>apple.decode('iso-8859-1').encode('utf8') </code></pre>

Python: Converting from ISO-8859-1/latin1 to UTF-8

Tags:

python

character-encoding

I have this string that has been decoded from Quoted-printable to ISO-8859-1 with the email module. This gives me strings like "\xC4pple" which would correspond to "Äpple" (Apple in Swedish). However, I can't convert those strings to UTF-8.

>>> apple = "\xC4pple" >>> apple '\xc4pple' >>> apple.encode("UTF-8") Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in     range(128)

What should I do?

237

asked Jun 30 '11 19:06

Zyberzero

2 Answers

This is a common problem, so here's a relatively thorough illustration.

For non-unicode strings (i.e. those without u prefix like u'\xc4pple'), one must decode from the native encoding (iso8859-1/latin1, unless modified with the enigmatic sys.setdefaultencoding function) to unicode, then encode to a character set that can display the characters you wish, in this case I'd recommend UTF-8.

First, here is a handy utility function that'll help illuminate the patterns of Python 2.7 string and unicode:

>>> def tell_me_about(s): return (type(s), s)

A plain string

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string  >>> tell_me_about(v) (<type 'str'>, '\xc4pple')  >>> v '\xc4pple'        # representation in memory  >>> print v ?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars                   # note that '\xc4' has no representation in iso-8859-1,                    # so is printed as "?".

Decoding a iso8859-1 string - convert plain string to unicode

>>> uv = v.decode("iso-8859-1") >>> uv u'\xc4pple'       # decoding iso-8859-1 becomes unicode, in memory  >>> tell_me_about(uv) (<type 'unicode'>, u'\xc4pple')  >>> print v.decode("iso-8859-1") Äpple             # convert unicode to the default character set                   # (utf-8, based on sys.stdout.encoding)  >>> v.decode('iso-8859-1') == u'\xc4pple' True              # one could have just used a unicode representation                    # from the start

A little more illustration — with “Ä”

>>> u"Ä" == u"\xc4" True              # the native unicode char and escaped versions are the same  >>> "Ä" == u"\xc4"   False             # the native unicode char is '\xc3\x84' in latin1  >>> "Ä".decode('utf8') == u"\xc4" True              # one can decode the string to get unicode  >>> "Ä" == "\xc4" False             # the native character and the escaped string are                   # of course not equal ('\xc3\x84' != '\xc4').

Encoding to UTF

>>> u8 = v.decode("iso-8859-1").encode("utf-8") >>> u8 '\xc3\x84pple'    # convert iso-8859-1 to unicode to utf-8  >>> tell_me_about(u8) (<type 'str'>, '\xc3\x84pple')  >>> u16 = v.decode('iso-8859-1').encode('utf-16') >>> tell_me_about(u16) (<type 'str'>, '\xff\xfe\xc4\x00p\x00p\x00l\x00e\x00')  >>> tell_me_about(u8.decode('utf8')) (<type 'unicode'>, u'\xc4pple')  >>> tell_me_about(u16.decode('utf16')) (<type 'unicode'>, u'\xc4pple')

Relationship between unicode and UTF and latin1

>>> print u8 Äpple             # printing utf-8 - because of the encoding we now know                   # how to print the characters  >>> print u8.decode('utf-8') # printing unicode Äpple  >>> print u16     # printing 'bytes' of u16 ���pple  >>> print u16.decode('utf16') Äpple             # printing unicode  >>> v == u8 False             # v is a iso8859-1 string; u8 is a utf-8 string  >>> v.decode('iso8859-1') == u8 False             # v.decode(...) returns unicode  >>> u8.decode('utf-8') == v.decode('latin1') == u16.decode('utf-16') True              # all decode to the same unicode memory representation                   # (latin1 is iso-8859-1)

Unicode Exceptions

 >>> u8.encode('iso8859-1') Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:   ordinal not in range(128)  >>> u16.encode('iso8859-1') Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:   ordinal not in range(128)  >>> v.encode('iso8859-1') Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:   ordinal not in range(128)

One would get around these by converting from the specific encoding (latin-1, utf8, utf16) to unicode e.g. u8.decode('utf8').encode('latin1').

So perhaps one could draw the following principles and generalizations:

a type str is a set of bytes, which may have one of a number of encodings such as Latin-1, UTF-8, and UTF-16
a type unicode is a set of bytes that can be converted to any number of encodings, most commonly UTF-8 and latin-1 (iso8859-1)
the print command has its own logic for encoding, set to sys.stdout.encoding and defaulting to UTF-8
One must decode a str to unicode before converting to another encoding.

Of course, all of this changes in Python 3.x.

Hope that is illuminating.

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: Converting from ISO-8859-1/latin1 to UTF-8

Tags:

python

character-encoding

Zyberzero

People also ask

2 Answers

A plain string

Decoding a iso8859-1 string - convert plain string to unicode

A little more illustration — with “Ä”

Encoding to UTF

Relationship between unicode and UTF and latin1

Unicode Exceptions

Further reading

Brian M. Hunt

Mat

Recent Activity

Donate For Us

Python: Converting from ISO-8859-1/latin1 to UTF-8

Tags:

python

character-encoding

Zyberzero

People also ask

2 Answers

A plain string

Decoding a iso8859-1 string - convert plain string to unicode

A little more illustration — with “Ä”

Encoding to UTF

Relationship between unicode and UTF and latin1

Unicode Exceptions

Further reading

Brian M. Hunt

Mat

Related questions

Recent Activity

Donate For Us