Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError: 'utf-8' codec can't decode byte error

I'm trying to get a response from urllib and decode it to a readable format. The text is in Hebrew and also contains characters like { and /

top page coding is:

# -*- coding: utf-8 -*-

raw string is:

b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'

Now I'm trying to decode it using:

 data = data.decode()

and I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
like image 854
user1641071 Avatar asked Jul 08 '14 12:07

user1641071


People also ask

What does UnicodeDecodeError mean?

What does UnicodeDecodeError mean? The Python "UnicodeDecodeError: 'ascii' codec can't decode byte in position" occurs when we use the ascii codec to decode bytes that were encoded using a different codec. To solve the error, specify the correct encoding, e.g. utf-8 .01-May-2022.

What is byte 0xb0?

This is indicated by the most significant bit of the byte. 0xb0 translates to 1011 0000 in binary and as you can see, the first bit is a 1 and that tells the utf-8 decoder that it needs more bytes for the character to be read.

How do you decode bytes in Python?

decode() is used to decode bytes to a string object. Decoding to a string object depends on the specified arguments. It also allows us to mention an error handling scheme to use for seconding errors. Note: bytes is a built-in binary sequence type in Python.


1 Answers

Your problem is that that is not UTF-8. You have UTF-16 encoded data, decode it as such:

>>> data = b'\xff\xfe{\x00 \x00\r\x00\n\x00"\x00i\x00d\x00"\x00 \x00:\x00 \x00"\x001\x004\x000\x004\x008\x003\x000\x000\x006\x004\x006\x009\x006\x00"\x00,\x00\r\x00\n\x00"\x00t\x00i\x00t\x00l\x00e\x00"\x00 \x00:\x00 \x00"\x00\xe4\x05\xd9\x05\xe7\x05\xd5\x05\xd3\x05 \x00\xd4\x05\xe2\x05\xd5\x05\xe8\x05\xe3\x05 \x00\xd4\x05\xea\x05\xe8\x05\xe2\x05\xd4\x05 \x00\xd1\x05\xde\x05\xe8\x05\xd7\x05\xd1\x05 \x00"\x00,\x00\r\x00\n\x00"\x00d\x00a\x00t\x00a\x00"\x00 \x00:\x00 \x00[\x00]\x00\r\x00\n\x00}\x00\r\x00\n\x00\r\x00\n\x00'
>>> data.decode('utf16')
'{ \r\n"id" : "1404830064696",\r\n"title" : "פיקוד העורף התרעה במרחב ",\r\n"data" : []\r\n}\r\n\r\n'
>>> import json
>>> json.loads(data.decode('utf16'))
{'title': 'פיקוד העורף התרעה במרחב ', 'id': '1404830064696', 'data': []}

If you loaded this from a website with urllib.request, the Content-Type header should contain a charset parameter telling you this; if response is the returned urllib.request response object, then use:

codec = response.info().get_content_charset('utf-8')

This defaults to UTF-8 when no charset parameter has been set, which is the appropriate default for JSON data.

Alternatively, use the requests library to load the JSON response, it handles decoding automatically (including UTF-codec autodetection specific to JSON responses).

One further note: the PEP 263 source code codec comment is used only to interpret your source code, including string literals. It has nothing to do with encodings of external sources (files, network data, etc.).

like image 195
Martijn Pieters Avatar answered Oct 03 '22 16:10

Martijn Pieters