Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

Sometimes I get characters like this œ that cause the UnicodeDecodeError error.

I need to be able to make the string UTF-8 with or without those characters.


Update:

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

EHLO example.com MAIL FROM: <[email protected]> ... 

I was logging all of this in JSON.

Then some folks out there without good intentions decided to send all kind of junk.

That is why for my specific case it is perfectly OK to strip the non ASCII characters.

like image 960
transilvlad Avatar asked Sep 17 '12 22:09

transilvlad


People also ask

What is UTF-8 codec can't decode byte?

The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode ( rb or wb ).

How do I decode a UTF-8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.

What is byte 0xb0?

This is indicated by the most significant bit of the byte. 0xb0 translates to 1011 0000 in binary and as you can see, the first bit is a 1 and that tells the utf-8 decoder that it needs more bytes for the character to be read.


2 Answers

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace') 

or

str = unicode(str, errors='ignore') 

Note: This will strip out (ignore) the characters in question returning the string without them.

For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.

Alternatively: Use the open method from the codecs module to read in the file:

import codecs with codecs.open(file_name, 'r', encoding='utf-8',                  errors='ignore') as fdata: 
like image 51
transilvlad Avatar answered Oct 05 '22 23:10

transilvlad


Changing the engine from C to Python did the trick for me.

Engine is C:

pd.read_csv(gdp_path, sep='\t', engine='c') 

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

Engine is Python:

pd.read_csv(gdp_path, sep='\t', engine='python') 

No errors for me.

like image 36
Doğuş Avatar answered Oct 05 '22 23:10

Doğuş