Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert encoding in Python?

I have a string with miss encoding »Æ¹ûÊ÷. On http://2cyr.com/decode/?lang=en website, you can encode it with gb2312 then decode it with iso8859 so to display it correctly.

In C#, there's a function called Encoding.Convert, which can help you convert convert the bytes from one encoding to the other. In process is straight forward:

encode the string into bytesA, using gb2312 encoder
Encoding.Convert bytesA from gb2312 encoding to iso8859 encoding
decode the bytes using iso8859 encoder

In Python, I have tried all kinds of encoding and decoding methods I can think of, but no one can help me convert the given string to the correct codecs that can be displayed correctly.

like image 621
David S. Avatar asked Jan 04 '14 14:01

David S.


People also ask

How do I change the encoding of a string in Python?

Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

How do you use encoding in Python?

Python String encode() MethodThe encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.


1 Answers

Your data is UTF-8 encoded GB2312, at least as pasted into my UTF-8 configured terminal window:

>>> data = '»Æ¹ûÊ÷'
>>> data.decode('utf8').encode('latin1').decode('gb2312')
u'\u9ec4\u679c\u6811'
>>> print _
黄果树

Encoding to Latin 1 lets us interpret characters as bytes to fix the encoding.

Rule of thumb: whenever you have double-encoded data, undo the extra 'layer' of encoding by decoding to Unicode using that codec, then encoding again with Latin-1 to get bytes again.

like image 131
Martijn Pieters Avatar answered Sep 20 '22 01:09

Martijn Pieters