Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

Tags:

python

json

I am using Python-2.6 CGI scripts but found this error in server log while doing json.dumps(),

Traceback (most recent call last):
  File "/etc/mongodb/server/cgi-bin/getstats.py", line 135, in <module>
    print json.dumps(​​__get​data())
  File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

​Here ,

​__get​data() function returns dictionary {} .

Before posting this question I have referred this of question os SO.


UPDATES

Following line is hurting JSON encoder,

now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) # this is the culprit

I got a temporary fix for it

print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })

But I am not sure is it correct way to do it.

like image 587
Deepak Ingole Avatar asked Mar 06 '14 05:03

Deepak Ingole


People also ask

What utf 8 codec can t decode byte?

The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode ( rb or wb ).

How do I decode a UTF 8 string in Python?

To decode a string encoded in UTF-8 format, we can use the decode() method specified on strings. This method accepts two arguments, encoding and error . encoding accepts the encoding of the string to be decoded, and error decides how to handle errors that arise during decoding.


17 Answers

I switched this simply by defining a different codec package in the read_csv() command:

encoding = 'unicode_escape'

Eg:

import pandas as pd
data = pd.read_csv(filename, encoding= 'unicode_escape')
like image 128
MSalty Avatar answered Sep 28 '22 06:09

MSalty


Try the below code snippet:

with open(path, 'rb') as f:
  text = f.read()
like image 40
Soumyaansh Avatar answered Sep 28 '22 07:09

Soumyaansh


The error is because there is some non-ascii character in the dictionary and it can't be encoded/decoded. One simple way to avoid this error is to encode such strings with encode() function as follows (if a is the string with non-ascii character):

a.encode('utf-8').strip()
like image 29
Santosh Ghimire Avatar answered Sep 28 '22 07:09

Santosh Ghimire


Your string has a non ascii character encoded in it.

Not being able to decode with utf-8 may happen if you've needed to use other encodings in your code. For example:

>>> 'my weird character \x96'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 19: invalid start byte

In this case, the encoding is windows-1252 so you have to do:

>>> 'my weird character \x96'.decode('windows-1252')
u'my weird character \u2013'

Now that you have Unicode, you can safely encode into utf-8.

like image 34
JCF Avatar answered Sep 28 '22 08:09

JCF


On read csv, I added an encoding method:

import pandas as pd
dataset = pd.read_csv('sample_data.csv', header= 0,
                        encoding= 'unicode_escape')
like image 30
Krishna prasad.m Avatar answered Sep 28 '22 06:09

Krishna prasad.m


This solution worked for me:

import pandas as pd
data = pd.read_csv("training.csv", encoding = 'unicode_escape')
like image 32
shiva Avatar answered Sep 28 '22 07:09

shiva


Inspired by @aaronpenne and @Soumyaansh

f = open("file.txt", "rb")
text = f.read().decode(errors='replace')
like image 43
Punnerud Avatar answered Sep 28 '22 07:09

Punnerud


As of 2018-05 this is handled directly with decode, at least for Python 3.

I'm using the below snippet for invalid start byte and invalid continuation byte type errors. Adding errors='ignore' fixed it for me.

with open(out_file, 'rb') as f:
    for line in f:
        print(line.decode(errors='ignore'))
like image 30
aaronpenne Avatar answered Sep 28 '22 07:09

aaronpenne


Simple Solution:

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')
like image 32
Gil Baggio Avatar answered Sep 28 '22 07:09

Gil Baggio


Set default encoder at the top of your code

import sys
reload(sys)
sys.setdefaultencoding("ISO-8859-1")
like image 31
HimalayanCoder Avatar answered Sep 28 '22 08:09

HimalayanCoder


If the above methods are not working for you, you may want to look into changing the encoding of the csv file itself.

Using Excel:

  1. Open csv file using Excel
  2. Navigate to File menu option and click Save As
  3. Click Browse to select a location to save the file
  4. Enter intended filename
  5. Select CSV (Comma delimited) (*.csv) option
  6. Click Tools drop-down box and click Web Options
  7. Under Encoding tab, select the option Unicode (UTF-8) from Save this document as drop-down list
  8. Save the file

Using Notepad:

  1. Open csv file using notepad
  2. Navigate to File > Save As option
  3. Next, select the location to the file
  4. Select the Save as type option as All Files(.)
  5. Specify the file name with .csv extension
  6. From Encoding drop-down list, select UTF-8 option.
  7. Click Save to save the file

By doing this, you should be able to import csv files without encountering the UnicodeCodeError.

like image 24
Zuo Avatar answered Sep 28 '22 07:09

Zuo


Following line is hurting JSON encoder,

now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) // this is the culprit

I got a temporary fix for it

print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })

Marking this as correct as a temporary fix (Not sure so).

like image 30
Deepak Ingole Avatar answered Sep 28 '22 06:09

Deepak Ingole


The following snippet worked for me.

import pandas as pd
df = pd.read_csv(filename, sep = ';', encoding = 'latin1', error_bad_lines=False) #error_bad_lines is avoid single line error
like image 27
amit haldar Avatar answered Sep 28 '22 07:09

amit haldar


You may use any standard encoding of your specific usage and input.

utf-8 is the default.

iso8859-1 is also popular for Western Europe.

e.g: bytes_obj.decode('iso8859-1')

see: docs

like image 23
NoamG Avatar answered Sep 28 '22 06:09

NoamG


After trying all the aforementioned workarounds, if it still throws the same error, you can try exporting the file as CSV (a second time if you already have). Especially if you're using scikit learn, it is best to import the dataset as a CSV file.

I spent hours together, whereas the solution was this simple. Export the file as a CSV to the directory where Anaconda or your classifier tools are installed and try.

like image 39
Sushmita Avatar answered Sep 28 '22 06:09

Sushmita


Instead of looking for ways to decode a5 (Yen ¥) or 96 (en-dash ), tell MySQL that your client is encoded "latin1", but you want "utf8" in the database.

See details in Trouble with UTF-8 characters; what I see is not what I stored

like image 31
Rick James Avatar answered Sep 28 '22 07:09

Rick James


I encountered the same error while trying to import to a pandas dataframe from an excel sheet on sharepoint. My solution was using engine='openpyxl'. I'm also using requests_negotiate_sspi to avoid storing passwords in plain text.

import requests
from io import BytesIO
from requests_negotiate_sspi import HttpNegotiateAuth
cert = r'c:\path_to\saved_certificate.cer'
target_file_url = r'https://share.companydomain.com/sites/Sitename/folder/excel_file.xlsx'
response = requests.get(target_file_url, auth=HttpNegotiateAuth(), verify=cert)
df = pd.read_excel(BytesIO(response.content), engine='openpyxl', sheet_name='Sheet1')
like image 31
Paul Avatar answered Sep 28 '22 08:09

Paul