I'm using the following code to unzip and save a CSV file:
with gzip.open(filename_gz) as f:
file = open(filename, "w");
output = csv.writer(file, delimiter = ',')
output.writerows(csv.reader(f, dialect='excel', delimiter = ';'))
Everything seems to work, except for the fact that the first characters in the file are unexpected. Googling around seems to indicate that it is due to BOM in the file.
I've read that encoding the content in utf-8-sig should fix the issue. However, adding:
.read().encoding('utf-8-sig')
to f in csv.reader fails with:
File "ckan_gz_datastore.py", line 16, in <module>
output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';'))
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
How can I remove the BOM and just save the content in correct utf-8?
First, you need to decode the file contents, not encode them. Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8. Finally, csv. reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.
The ÿþ character is known as the byte order marking (BOM) character and is commonly found as the first line of a CSV file. ÿþ can not be seen when the CSV is opened with Notepad or Excel for that an Editor is required that can display the BOM (Byte Order Mark).
First, open the CSV file for writing ( w mode) by using the open() function. Second, create a CSV writer object by calling the writer() function of the csv module. Third, write data to CSV file by calling the writerow() or writerows() method of the CSV writer object.
First, you need to decode the file contents, not encode them.
Second, the csv
module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.
Finally, csv.reader
is passed an iteration over the lines of the file, not a big string with linebreaks in it.
So:
csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())
However, you might consider it simpler / more efficent just to remove the BOM manually:
def remove_bom(line):
return line[3:] if line.startswith(codecs.BOM_UTF8) else line
csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')
That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:
def remove_bom_from_first(iterable):
f = iter(iterable)
firstline = next(f, None)
if firstline is not None:
yield remove_bom(firstline)
for line in f:
yield f
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With