Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Reading and writing csv files with utf-8 encoding

I'm trying to read a csv file the its header contains foreign characters and I'm having a lot of problems with this.

first of all, I'm reading the file with a simple csv.reader

filename = 'C:\\Users\\yuval\\Desktop\\בית ספר\\עבודג\\new\\resources\\mk'+ str(mkNum) + 'Data.csv'
raw_data = open(filename, 'rt', encoding="utf8")
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
header = x[0]
data = np.array(x[1:]).astype('float')

The var header should be an array that contains the file headers, but the array it returns to me is

['\ufeff"dayPart"', '"length"', '"ifPhoto"', '"ifVideo"', '"ifAlbum"', '"לא"', '"הוא"', '"בכל"', '"אותם"', '"זה"', '"הם"', '"כדי"', '"את"', '"יש"', '"לי"', '"היא"', '"אני"', '"רק"', '"להם"', '"על"', '"עם"', '"של"', '"המדינה"', '"כל"', '"גם"', '"הזה"', '"אם"', '"ישראל"', '"לכל"', '"מי"', '"ל"', '"אמסלם"', '"לנו"', '"אבל"', '"זו"', '"אין"', '"שבת"', '"שלום"', '"כ"', '"שלנו"', '"היום"', '"ומבורך"', '"ח"', '"דודי"', '"ר"', '"הפנים"', '"מה"', '"כי"', '"ה"', '"אחד"', '"ולא"', '"יותר"']

and I don't know why it adds the \ufeff in the first object and double quotation marks.

After that, I need to write to another csv file and use foreign characters in the header as well. I was trying to do this like that, but it wrote the characters as weird symbols.

with open('C:\\Users\\yuval\\Desktop\\בית ספר\\עבודג\\new\\variance reduction 1\\mk'+ str(mkNum) + 'Data.csv', 'w', newline='', encoding='utf8') as csvFile:
    csvWriter = csv.writer(csvFile, delimiter=',')
    csvWriter.writerow(newHeader)

Does any one know how to fix this problem and work with utf8 encoding in the csv file's header?

like image 384
Yuval Buium Avatar asked Jan 03 '18 21:01

Yuval Buium


1 Answers

You report three separate problems. This is a bit of a guess into the blue, because there's not enough information to be sure, but you should try the following:

  1. input encoding: As suggested in comments, try "utf-8-sig". This will remove the Byte Order Mark (BOM) from your input.

  2. double quotes: Among the csv parameters, you specify quoting=csv.QUOTE_NONE. This tells the csv library that the CSV table was written without using quotes (for escaping characters that could otherwise be mistaken for field or row separators). However, this is apparently not true, since the input has quotes around each field. Try csv.QUOTE_MINIMAL (the default) or csv.QUOTE_ALL instead.

  3. output encoding: You say the output contains "weird symbols". I suspect that the output is actually alright, but you are using a tool which doesn't properly display UTF-8 text by default: many Windows applications (such as Excel) still prefer UTF-16 and localised 8-bit encodings like CP-1255. Like for problem 1, you should try the codec "utf-8-sig": the BOM is understood as an encoding hint by many viewers/editors.

like image 142
lenz Avatar answered Oct 11 '22 14:10

lenz