Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing BOM from gzip'ed CSV in Python

I'm using the following code to unzip and save a CSV file:

with gzip.open(filename_gz) as f:
    file = open(filename, "w");
    output = csv.writer(file, delimiter = ',')
    output.writerows(csv.reader(f, dialect='excel', delimiter = ';'))

Everything seems to work, except for the fact that the first characters in the file are unexpected. Googling around seems to indicate that it is due to BOM in the file.

I've read that encoding the content in utf-8-sig should fix the issue. However, adding:

.read().encoding('utf-8-sig')

to f in csv.reader fails with:

File "ckan_gz_datastore.py", line 16, in <module>
    output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';'))
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode
    return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

How can I remove the BOM and just save the content in correct utf-8?

like image 708
user809829 Avatar asked Jan 03 '14 08:01

user809829


People also ask

How do I remove a BOM character from a csv file in Python?

First, you need to decode the file contents, not encode them. Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8. Finally, csv. reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.

What is CSV BOM?

The ÿþ character is known as the byte order marking (BOM) character and is commonly found as the first line of a CSV file. ÿþ can not be seen when the CSV is opened with Notepad or Excel for that an Editor is required that can display the BOM (Byte Order Mark).

How do I dump a csv file in Python?

First, open the CSV file for writing ( w mode) by using the open() function. Second, create a CSV writer object by calling the writer() function of the csv module. Third, write data to CSV file by calling the writerow() or writerows() method of the CSV writer object.


1 Answers

First, you need to decode the file contents, not encode them.

Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.

Finally, csv.reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.

So:

csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())

However, you might consider it simpler / more efficent just to remove the BOM manually:

def remove_bom(line):
    return line[3:] if line.startswith(codecs.BOM_UTF8) else line

csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')

That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:

def remove_bom_from_first(iterable):
    f = iter(iterable)
    firstline = next(f, None)
    if firstline is not None:
        yield remove_bom(firstline)
        for line in f:
            yield f
like image 81
Steve Jessop Avatar answered Sep 28 '22 03:09

Steve Jessop