Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas load csv ANSI Format as UTF-8

I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ä,ö,ü,ß.

When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:

Empf„nger;Empf„ngerStadt;Empf„ngerStraáe;Empf„ngerHausnr.;Empf„ngerPLZ;Empf„ngerLand

The correct UTF-8 outcome for Empf„nger should be: Empfänger

Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:

df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')

I get and Error Message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte

Position 'xy' is the position where the character occurs that causes the error message

when i use the ansi format to load my csv file it works but display the umlaute incorrect.

Example code:

df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')

Empfänger is represented as: Empf„nger

Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.

I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or

import chardet

with open('afile.csv', 'rb') as f:
    result = chardet.detect(f.readline())

df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])

didnt work for me.

encoding='cp1252'

throws the following exception:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>

I also tried to replace Strings afterwards with the x.replace() method but the character ü disappears completely after loaded into a pandas DataFrame

like image 261
MBUser Avatar asked May 04 '17 15:05

MBUser


People also ask

How do I save ANSI encoding to CSV?

For Windows:Open the CSV file using Notepad. Click "File > Save As". In the dialog window that appears - select "ANSI" from the "Encoding" field. Then click "Save".

Is CSV Ascii or UTF-8?

On a more technical note, Excel does try to automatically identify Unicode CSV, but gets it wrong. It turns out that Excel saves UTF-8 CSV with a byte order mark (BOM), and when opening CSV only recognises the encoding as UTF-8 if a BOM is present.

Which CSV format to use for Pandas?

Pandas read_csv() function imports a CSV file to DataFrame format. header: this allows you to specify which row will be used as column names for your dataframe.


2 Answers

If you don't know which are your file encoding, I think that the fastest approach is to open the file on a text editor, like Notepad++ to check how your file are encoding.

Then you go to the python documentation and look for the correct codec to use.

In your case , ANSI, the codec is 'mbcs', so your code will look like these

df_a = pd.read_csv('file.csv',sep=';',encoding='mbcs')
like image 195
rflmorais Avatar answered Sep 23 '22 08:09

rflmorais


When EmpfängerStraße shows up as Empf„ngerStraáe when decoded as ”ANSI”, or more correctly cp1250 in this case, then the actual encoding of the data is most likely cp850:

print 'Empf„ngerStraáe'.decode('utf8').encode('cp1250').decode('cp850')

Or Python 3, where literal strings are already unicode strings:

print("Empf„ngerStraáe".encode("cp1250").decode("cp850"))
like image 35
BlackJack Avatar answered Sep 21 '22 08:09

BlackJack