Python how to decode unicode with hex characters

Tags:

I have extracted a string from web crawl script as following:

u'\xe3\x80\x90\xe4\xb8\xad\xe5\xad\x97\xe3\x80\x91'

I want to decode u'\xe3\x80\x90\xe4\xb8\xad\xe5\xad\x97\xe3\x80\x91' with utf-8. With http://ddecode.com/hexdecoder/, I can see the result is '【中字】'

I tried using the following syntax but failed.

msg = u'\xe3\x80\x90\xe4\xb8\xad\xe5\xad\x97\xe3\x80\x91'
result = msg.decode('utf8')

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-11: ordi
nal not in range(128)

May I ask how to decode the string correctly?

Thanks for help.

763

asked Oct 13 '16 08:10

Shooting Chuang

2 Answers

The problem with

msg = u'\xe3\x80\x90\xe4\xb8\xad\xe5\xad\x97\xe3\x80\x91'
result = msg.decode('utf8')

is that you are trying to decode Unicode. That doesn't really make sense. You can encode from Unicode to some type of encoding, or you can decode a byte string to Unicode.

When you do

msg.decode('utf8')

Python 2 sees that msg is Unicode. It knows that it can't decode Unicode so it "helpfully" assumes that you want to encode msg with the default ASCII codec so the result of that transformation can be decoded to Unicode using the UTF-8 codec. Python 3 behaves much more sensibly: that code would simply fail with

AttributeError: 'str' object has no attribute 'decode'

The technique given in kennytm's answer:

msg.encode('latin1').decode('utf-8')

works because the Unicode codepoints less than 256 correspond directly to the characters in the Latin1 encoding (aka ISO 8859-1).

Here's some Python 2 code that illustrates this:

for i in xrange(256):
    lat = chr(i)
    uni = unichr(i)
    assert lat == uni.encode('latin1')
    assert lat.decode('latin1') == uni

And here is the equivalent Python 3 code:

for i in range(256):
    lat = bytes([i])
    uni = chr(i)
    assert lat == uni.encode('latin1')
    assert lat.decode('latin1') == uni

You may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.

Unless you are forced to use Python 2 I strongly advise you to switch to Python 3. It will make handling Unicode far less painful.

159

answered Nov 10 '22 00:11

PM 2Ring

Perhaps you should fix the crawl script instead, a Unicode string should contain u'【中字】' (u'\u3010\u4e2d\u5b57\u3011') already, instead of the raw UTF-8 bytes.
To convert msg to the correct encoding, first you need to turn the wrong Unicode string back to byte string (encode it as Latin-1), then decode it as UTF-8:
```
>>> print msg.encode('latin1').decode('utf-8')
【中字】
```

answered Nov 09 '22 23:11

kennytm

Related questions
                            
                                Prepending instead of appending NaNs in pandas using from_dict
                            
                                Monkey-patching bound methods in python [duplicate]
                            
                                AttributeError: 'list' object has no attribute 'items' in a scrapy
                            
                                Python currying with any number of variables
                            
                                Convert elements of a list into binary
                            
                                python Selenium PermissionError: [WinError 5] Access is denied
                            
                                Plotting Sympy Result to Particular Solution of Differential Equation
                            
                                How can calculate the real distance between two points with GeoDjango?
                            
                                Two dimensional color ramp (256x256 matrix) interpolated from 4 corner colors
                            
                                PyCharm cannot find installed packages: keras
                            
                                Python scan for WiFi
                            
                                How to convert unicode numbers to ints?
                            
                                PyQt - QDialogButtonBox signals and tool tip
                            
                                Setting an index limit in SQLAlchemy
                            
                                How can I draw a point with Canvas in Tkinter?
                            
                                How to run non-linear regression in python
                            
                                Don't show zero values on 2D heat map
                            
                                Make an object that behaves like a slice
                            
                                Difference between cv2.findNonZero and Numpy.NonZero
                            
                                Get a random sample of a dict

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python how to decode unicode with hex characters

Tags:

python

python-2.x

utf-8

Shooting Chuang

People also ask

2 Answers

PM 2Ring

kennytm

Recent Activity

Donate For Us