I have a string with miss encoding »Æ¹ûÊ÷
. On http://2cyr.com/decode/?lang=en website, you can encode it with gb2312
then decode it with iso8859
so to display it correctly.
In C#, there's a function called Encoding.Convert, which can help you convert convert the bytes from one encoding to the other. In process is straight forward:
encode the string into bytesA, using gb2312 encoder
Encoding.Convert bytesA from gb2312 encoding to iso8859 encoding
decode the bytes using iso8859 encoder
In Python, I have tried all kinds of encoding and decoding methods I can think of, but no one can help me convert the given string to the correct codecs that can be displayed correctly.
Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
Python String encode() MethodThe encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.
Your data is UTF-8 encoded GB2312, at least as pasted into my UTF-8 configured terminal window:
>>> data = '»Æ¹ûÊ÷'
>>> data.decode('utf8').encode('latin1').decode('gb2312')
u'\u9ec4\u679c\u6811'
>>> print _
黄果树
Encoding to Latin 1 lets us interpret characters as bytes to fix the encoding.
Rule of thumb: whenever you have double-encoded data, undo the extra 'layer' of encoding by decoding to Unicode using that codec, then encoding again with Latin-1 to get bytes again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With