I have a question about Python 2.7 read/write csv file with 'utf-8-sig
' code, my csv . header is
['\xef\xbb\xbfID;timestamp;CustomerID;Email']
there have some code("\xef\xbb\xbfID"
) I read from file A.csv
and I want write the same code and header to file B.csv
My print log is shows:
['\xef\xbb\xbfID;timestamp;CustomerID;Email']
But the actual output file header it looks like
ÔªøID;timestamp
Here is the code:
def remove_gdpr_info_from_csv(file_path, file_name, temp_folder, original_header):
new_temp_folder = tempfile.mkdtemp()
new_temp_file = new_temp_folder + "/" + file_name
# Blanked new file
with open(new_temp_file, 'wb') as outfile:
writer = csv.writer(outfile, delimiter=";")
print original_header
writer.writerow(original_header)
# File from SFTP
with open(file_path, 'r') as infile:
reader = csv.reader(infile, delimiter=";")
first_row = next(reader)
email = first_row.index('Email')
contract_detractor1 = first_row.index('Contact Detractor (Q21)')
contract_detractor2 = first_row.index('Contact Detractor (Q20)')
contract_detractor3 = first_row.index('Contact Detractor (Q43)')
contract_detractor4 = first_row.index('Contact Detractor(Q26)')
contract_detractor5 = first_row.index('Contact Detractor(Q27)')
contract_detractor6 = first_row.index('Contact Detractor(Q44)')
indexes = []
for column_name in header_list:
ind = first_row.index(column_name)
indexes.append(ind)
for row in reader:
output_row = []
for ind in indexes:
data = row[ind]
if ind == email:
data = ''
elif ind == contract_detractor1:
data = ''
elif ind == contract_detractor2:
data = ''
elif ind == contract_detractor3:
data = ''
elif ind == contract_detractor4:
data = ''
elif ind == contract_detractor5:
data = ''
elif ind == contract_detractor6:
data = ''
output_row.append(data)
writer.writerow(output_row)
s3core.upload_files(SPARKY_S3, DESTINATION_PATH, new_temp_file)
shutil.rmtree(temp_folder)
shutil.rmtree(new_temp_folder)
'\xef\xbb\xbf'
is the UTF8 encoded version of the unicode ZERO WIDTH NO-BREAK SPACE U+FEFF. It is often used as a Byte Order Mark at the beginning of unicode text files:
'\xef\xbb\xbf'
, then the file is utf8 encoded'\xff\xfe'
, then the file is in utf16 little endian'\xfe\xff'
, then the file is in utf16 big endianThe 'utf-8-sig'
encoding explicitely asks for writing this BOM at the beginning of the file
To process it automatically at read time of a csv file in Python 2, you can use the codecs module:
with open(file_path, 'r') as infile:
reader = csv.reader(codecs.EncodedFile(infile, 'utf-8', 'utf-8-sig'), delimiter=";")
EncodedFile
will wrap the original file object by decoding it in utf8-sig
, actually skipping the BOM and re-encoding it in utf8
with no BOM.
You want to use the EncodedFile
method from the codecs
library as in Serge Ballesta's answer.
However using Python 2.7 the encoding utf-8-sig
is not a supported alias for the UTF8-sig encoding, you need to use utf_8_sig
. Additionally the order of the method properties needs to define the output data encoding first, and the file encoding second: codecs.EncodedFile(file,datacodec,filecodec=None,errors=’strict')
Here's the full result:
import codecs
with open(file_path, 'r') as infile:
reader = csv.reader(codecs.EncodedFile(infile, 'utf8', 'utf_8_sig'), delimiter=";")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With