I have a socket server that is supposed to receive UTF-8 valid characters from clients. The problem is some clients (mainly hackers) are sending all the wrong kind of data over it. I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later. Sometimes I get characters like this <code>&oelig;</code> that cause the <code>UnicodeDecodeError</code> error. I need to be able to make the string UTF-8 with or without those characters. <hr> Update: For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as: <pre class="prettyprint"><code>EHLO example.com MAIL FROM: <john.doe@example.com> ... </code></pre> I was logging all of this in JSON. Then some folks out there without good intentions decided to send all kind of junk. That is why for my specific case it is perfectly OK to strip the non ASCII characters.

http://docs.python.org/howto/unicode.html#the-unicode-type <pre class="prettyprint"><code>str = unicode(str, errors='replace') </code></pre> or <pre class="prettyprint"><code>str = unicode(str, errors='ignore') </code></pre> Note: This will strip out (ignore) the characters in question returning the string without them. For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application. Alternatively: Use the open method from the <code>codecs</code> module to read in the file: <pre class="prettyprint"><code>import codecs with codecs.open(file_name, 'r', encoding='utf-8', errors='ignore') as fdata: </code></pre>

Changing the engine from C to Python did the trick for me. Engine is C: <pre class="prettyprint"><code>pd.read_csv(gdp_path, sep='\t', engine='c') </code></pre> <blockquote> 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte </blockquote> Engine is Python: <pre class="prettyprint"><code>pd.read_csv(gdp_path, sep='\t', engine='python') </code></pre> No errors for me.

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

Tags:

python

linux

python-unicode

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

Sometimes I get characters like this œ that cause the UnicodeDecodeError error.

I need to be able to make the string UTF-8 with or without those characters.

Update:

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

EHLO example.com MAIL FROM: <[email protected]> ...

I was logging all of this in JSON.

Then some folks out there without good intentions decided to send all kind of junk.

That is why for my specific case it is perfectly OK to strip the non ASCII characters.

960

asked Sep 17 '12 22:09

transilvlad

2 Answers

http://docs.python.org/howto/unicode.html#the-unicode-type

str = unicode(str, errors='replace')

str = unicode(str, errors='ignore')

Note: This will strip out (ignore) the characters in question returning the string without them.

For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.

Alternatively: Use the open method from the codecs module to read in the file:

import codecs with codecs.open(file_name, 'r', encoding='utf-8',                  errors='ignore') as fdata:

answered Oct 05 '22 23:10

transilvlad

Changing the engine from C to Python did the trick for me.

Engine is C:

pd.read_csv(gdp_path, sep='\t', engine='c')

'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte

Engine is Python:

pd.read_csv(gdp_path, sep='\t', engine='python')

No errors for me.

answered Oct 05 '22 23:10

Doğuş

Related questions
                            
                                Loop backwards using indices in Python?
                            
                                "pip install unroll": "python setup.py egg_info" failed with error code 1
                            
                                How to use filter, map, and reduce in Python 3
                            
                                What does asterisk * mean in Python? [duplicate]
                            
                                Get the row(s) which have the max value in groups using groupby
                            
                                Is it possible only to declare a variable without assigning any value in Python?
                            
                                Python strftime - date without leading 0?
                            
                                How to start a background process in Python?
                            
                                Join a list of items with different types as string in Python
                            
                                How can I display full (non-truncated) dataframe information in HTML when converting from Pandas dataframe to HTML?
                            
                                Normalize columns of pandas data frame
                            
                                Total memory used by Python process?
                            
                                Convert a python dict to a string and back
                            
                                Finding and replacing elements in a list
                            
                                Django Model() vs Model.objects.create()
                            
                                Bare asterisk in function arguments?
                            
                                What does axis in pandas mean?
                            
                                Pandas 'count(distinct)' equivalent
                            
                                NumPy array is not JSON serializable
                            
                                What are some common uses for Python decorators? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

Tags:

python

linux

python-unicode

transilvlad

People also ask

2 Answers

transilvlad

Doğuş

Recent Activity

Donate For Us