I'm using the following code to unzip and save a CSV file: <pre class="prettyprint"><code>with gzip.open(filename_gz) as f: file = open(filename, "w"); output = csv.writer(file, delimiter = ',') output.writerows(csv.reader(f, dialect='excel', delimiter = ';')) </code></pre> Everything seems to work, except for the fact that the first characters in the file are unexpected. Googling around seems to indicate that it is due to BOM in the file. I've read that encoding the content in utf-8-sig should fix the issue. However, adding: <pre class="prettyprint"><code>.read().encoding('utf-8-sig') </code></pre> to f in csv.reader fails with: <pre class="prettyprint"><code>File "ckan_gz_datastore.py", line 16, in <module> output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';')) File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) </code></pre> How can I remove the BOM and just save the content in correct utf-8?

First, you need to decode the file contents, not encode them. Second, the <code>csv</code> module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8. Finally, <code>csv.reader</code> is passed an iteration over the lines of the file, not a big string with linebreaks in it. So: <pre class="prettyprint"><code>csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines()) </code></pre> However, you might consider it simpler / more efficent just to remove the BOM manually: <pre class="prettyprint"><code>def remove_bom(line): return line[3:] if line.startswith(codecs.BOM_UTF8) else line csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';') </code></pre> That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with: <pre class="prettyprint"><code>def remove_bom_from_first(iterable): f = iter(iterable) firstline = next(f, None) if firstline is not None: yield remove_bom(firstline) for line in f: yield f </code></pre>

Removing BOM from gzip'ed CSV in Python

Tags:

python

csv

byte-order-mark

I'm using the following code to unzip and save a CSV file:

with gzip.open(filename_gz) as f:
    file = open(filename, "w");
    output = csv.writer(file, delimiter = ',')
    output.writerows(csv.reader(f, dialect='excel', delimiter = ';'))

Everything seems to work, except for the fact that the first characters in the file are unexpected. Googling around seems to indicate that it is due to BOM in the file.

I've read that encoding the content in utf-8-sig should fix the issue. However, adding:

.read().encoding('utf-8-sig')

to f in csv.reader fails with:

File "ckan_gz_datastore.py", line 16, in <module>
    output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';'))
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode
    return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

How can I remove the BOM and just save the content in correct utf-8?

708

asked Jan 03 '14 08:01

user809829

1 Answers

First, you need to decode the file contents, not encode them.

Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.

Finally, csv.reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.

So:

csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())

However, you might consider it simpler / more efficent just to remove the BOM manually:

def remove_bom(line):
    return line[3:] if line.startswith(codecs.BOM_UTF8) else line

csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')

That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:

def remove_bom_from_first(iterable):
    f = iter(iterable)
    firstline = next(f, None)
    if firstline is not None:
        yield remove_bom(firstline)
        for line in f:
            yield f

answered Sep 28 '22 03:09

Steve Jessop

Related questions
                            
                                Find max since condition in pandas timeseries dataframe
                            
                                Python 3 Sorting a List of Tuples? [closed]
                            
                                Difference between Class variables and Instance variables
                            
                                List Comprehension of Lists Nested in Dictionaries
                            
                                The equation -e**-((-log(7)/100.0)*(100-x))+7 returns NaN
                            
                                matplotlib change linewidth on line segments, using list
                            
                                How do I print this list vertically?
                            
                                generator vs. list comprehension
                            
                                Can't use read-write files with matplotlib's savefig()?
                            
                                call php function from python
                            
                                Alternative to Double Iteration
                            
                                cimport gives fatal error: 'numpy/arrayobject.h' file not found
                            
                                Count the number of occurrences between markers in a python list
                            
                                matplotlib not displaying intersection of 3D planes correctly
                            
                                Analytics API + Python Server, NotImplementedError Hello Analytics
                            
                                Python: Proper way to store list of strings in sqlite3 or mysql
                            
                                Python scripts stopped running on double-click in Windows
                            
                                alternative (faster) war to 3 nested for loop python
                            
                                Numpy: Assignment and Indexing as Matlab
                            
                                Improving performance of Cronbach Alpha code python numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With