Reading a CSV file into Pandas Dataframe with invalid characters (accents)

Tags:

I am trying to read a csv file into a pandas dataframe. However, the csv contains accents. I am using Python 2.7

I've ran into a UnicodeDecodeError because there is an accent in the first column. I've read up on a bunch of sites like this SO question about UTF-8 in CSV files, this blog post on CSV errors related to newlines, and this blog post on UTF-8 issues in Python 2.7.

I used answers I've found from there to try to modify my code. Originally I had:

import pandas as pd

#Create a dataframe with the data we are interested in
df = pd.DataFrame.from_csv('MYDATA.csv')
mode = lambda ts: ts.value_counts(sort=True).index[0]
cols = df['CompanyName'].value_counts().index
df['Calls'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)

Excetera. It worked, but now passing in "NÍ" and "Nê" as a customer name is giving the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 7: invalid continuation byte

I tried changing the line to df =pd.read_csv('MYDATA.csv',encoding ='utf-8') But this gives the same error.

So I tried this from the suggestions I found by researching, but it is not working either, and I am getting the same error.

import pandas as pd
import csv

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
    csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]


reader = unicode_csv_reader(open('MYDATA.csv','rU'), dialect = csv.reader)
#Create a dataframe with the data we are interested in
df =pd.DataFrame(reader)

I feel like it should not be this difficult to read csv data into a pandas dataframe. Does anyone know of an easier way?

Edit: What is really strange is that if I delete the row with the accented characters I still get the error

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 960: invalid continuation byte.

This is strange as my test csv has 19 rows and 27 columns. But I hope that if I decode utf8 for the entire csv it will fix the problem.

389

asked Jun 19 '15 19:06

jenryb

1 Answers

Try adding this to the top of your script:

import sys  
reload(sys) 
sys.setdefaultencoding('utf8')

answered Oct 12 '22 22:10

GNMO11

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading a CSV file into Pandas Dataframe with invalid characters (accents)

Tags:

python

pandas

dataframe

csv

utf-8

jenryb

People also ask

1 Answers

GNMO11

Recent Activity

Donate For Us

Reading a CSV file into Pandas Dataframe with invalid characters (accents)

Tags:

python

pandas

dataframe

csv

utf-8

jenryb

People also ask

1 Answers

GNMO11

Related questions

Recent Activity

Donate For Us