Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 CSV file giving UnicodeDecodeError: 'utf-8' codec can't decode byte error when I print

I have the following code in Python 3, which is meant to print out each line in a csv file.

import csv with open('my_file.csv', 'r', newline='') as csvfile:     lines = csv.reader(csvfile, delimiter = ',', quotechar = '|')     for line in lines:         print(' '.join(line)) 

But when I run it, it gives me this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte 

I looked through the csv file, and it turns out that if I take out a single ñ (little n with a tilde on top), every line prints out fine.

My problem is that I've looked through a bunch of different solutions to similar problems, but I still have no idea how to fix this, what to decode/encode, etc. Simply taking out the ñ character in the data is NOT an option.

like image 753
HLH Avatar asked Feb 01 '14 22:02

HLH


People also ask

What is UTF 8 codec can't decode byte?

The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode ( rb or wb ).

How do I check the encoding of a CSV file?

The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.


2 Answers

We know the file contains the byte b'\x96' since it is mentioned in the error message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte 

Now we can write a little script to find out if there are any encodings where b'\x96' decodes to ñ:

import pkgutil import encodings import os  def all_encodings():     modnames = set([modname for importer, modname, ispkg in pkgutil.walk_packages(         path=[os.path.dirname(encodings.__file__)], prefix='')])     aliases = set(encodings.aliases.aliases.values())     return modnames.union(aliases)  text = b'\x96' for enc in all_encodings():     try:         msg = text.decode(enc)     except Exception:         continue     if msg == 'ñ':         print('Decoding {t} with {enc} is {m}'.format(t=text, enc=enc, m=msg)) 

which yields

Decoding b'\x96' with mac_roman is ñ Decoding b'\x96' with mac_farsi is ñ Decoding b'\x96' with mac_croatian is ñ Decoding b'\x96' with mac_arabic is ñ Decoding b'\x96' with mac_romanian is ñ Decoding b'\x96' with mac_iceland is ñ Decoding b'\x96' with mac_turkish is ñ 

Therefore, try changing

with open('my_file.csv', 'r', newline='') as csvfile: 

to one of those encodings, such as:

with open('my_file.csv', 'r', encoding='mac_roman', newline='') as csvfile: 
like image 61
unutbu Avatar answered Sep 22 '22 02:09

unutbu


with open('my_file.csv', 'r', newline='', encoding='ISO-8859-1') as csvfile:

ñ character is not listed on UTC-8 encoding. To fix the issue, you may use ISO-8859-1 encoding instead. For more details about this encoding, you may refer to the link below: https://www.ic.unicamp.br/~stolfi/EXPORT/www/ISO-8859-1-Encoding.html

like image 36
Sir Markpo Avatar answered Sep 22 '22 02:09

Sir Markpo