Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UnicodeDecodeError when reading CSV file in Pandas with Python

I'm running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...

File "C:\Importer\src\dfman\importer.py", line 26, in import_chr      data = pd.read_csv(filepath, names=fields) File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f      return _read(filepath_or_buffer, kwds) File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read      return parser.read()    File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read      ret = self._engine.read(nrows) File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read      data = self._reader.read(nrows) File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745) File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964) File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780) File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793) File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484) File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642) File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853) File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 6: invalid    continuation byte 

The source/creation of these files all come from the same place. What's the best way to correct this to proceed with the import?

like image 259
TravisVOX Avatar asked Aug 11 '13 12:08

TravisVOX


People also ask

What is Unicode error in pandas?

pandas UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 6785: invalid start byte. The error might have several different reasons: different encoding. bad symbols. corrupted file.


2 Answers

read_csv takes an encoding option to deal with files in different formats. I mostly use read_csv('file', encoding = "ISO-8859-1"), or alternatively encoding = "utf-8" for reading, and generally utf-8 for to_csv.

You can also use one of several alias options like 'latin' or 'cp1252' (Windows) instead of 'ISO-8859-1' (see python docs, also for numerous other encodings you may encounter).

See relevant Pandas documentation, python docs examples on csv files, and plenty of related questions here on SO. A good background resource is What every developer should know about unicode and character sets.

To detect the encoding (assuming the file contains non-ascii characters), you can use enca (see man page) or file -i (linux) or file -I (osx) (see man page).

like image 199
Stefan Avatar answered Sep 19 '22 03:09

Stefan


Simplest of all Solutions:

import pandas as pd df = pd.read_csv('file_name.csv', engine='python') 

Alternate Solution:

  • Open the csv file in Sublime text editor or VS Code.
  • Save the file in utf-8 format.

In sublime, Click File -> Save with encoding -> UTF-8

Then, you can read your file as usual:

import pandas as pd data = pd.read_csv('file_name.csv', encoding='utf-8') 

and the other different encoding types are:

encoding = "cp1252" encoding = "ISO-8859-1" 
like image 41
Gil Baggio Avatar answered Sep 20 '22 03:09

Gil Baggio