UnicodeDecodeError: 'utf-8' codec can't decode byte error

Tags:

I'm trying to get a response from urllib and decode it to a readable format. The text is in Hebrew and also contains characters like { and /

top page coding is:

# -*- coding: utf-8 -*-

raw string is:

b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'

Now I'm trying to decode it using:

 data = data.decode()

and I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

854

asked Jul 08 '14 12:07

user1641071

1 Answers

Your problem is that that is not UTF-8. You have UTF-16 encoded data, decode it as such:

>>> data = b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
>>> data.decode('utf16')
'{ \r\n"id" : "1404830064696",\r\n"title" : "פיקוד העורף התרעה במרחב ",\r\n"data" : []\r\n}\r\n\r\n'
>>> import json
>>> json.loads(data.decode('utf16'))
{'title': 'פיקוד העורף התרעה במרחב ', 'id': '1404830064696', 'data': []}

If you loaded this from a website with urllib.request, the Content-Type header should contain a charset parameter telling you this; if response is the returned urllib.request response object, then use:

codec = response.info().get_content_charset('utf-8')

This defaults to UTF-8 when no charset parameter has been set, which is the appropriate default for JSON data.

Alternatively, use the requests library to load the JSON response, it handles decoding automatically (including UTF-codec autodetection specific to JSON responses).

One further note: the PEP 263 source code codec comment is used only to interpret your source code, including string literals. It has nothing to do with encodings of external sources (files, network data, etc.).

195

answered Oct 03 '22 16:10

Martijn Pieters

Related questions
                            
                                Python importing class attributes into method local namespace
                            
                                Application icon in PySide GUI
                            
                                How can I load initial data into a database using sqlalchemy
                            
                                Python (matplotlib) less-than-or-equal-to symbol in text
                            
                                How to avoid hanging Xvfb processes [while using PyVirtualDisplay]?
                            
                                Python: catch any exception and put it in a variable
                            
                                Importing a Flask-security instance into my views module breaks my webapp
                            
                                Align tabs from right to left using ttk.Notebook widget
                            
                                convert numpy string array into int array [duplicate]
                            
                                In Python/OpenCV is there a way to quickly scroll through frames of a video, allowing the user to select the start and end frame to be processed?
                            
                                Python Google Drive API - list the entire drive file tree
                            
                                python mock: @wraps(f) problems
                            
                                How to read image from StringIO into PIL in python
                            
                                Can I get SQLAlchemy to populate a relationship based on the current foreign key values?
                            
                                How can I change my PyPI username?
                            
                                What exactly does win32com.client.Dispatch("WScript.Shell")?
                            
                                Python sorting complexity on sorted list
                            
                                How to match all alphanumeric except underscore on Python
                            
                                Smoothed 2D histogram using matplotlib and imshow
                            
                                Import JSON data into Python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UnicodeDecodeError: 'utf-8' codec can't decode byte error

Tags:

python

encoding

utf-8

urllib

user1641071

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us