Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas to_csv to GzipFile in Python 3 not working

Tags:

python

pandas

Saving a Pandas dataframe to gzipped csv in memory works like this in Python 2.7 (Pandas 0.22.0):

from io import BytesIO
import gzip
import pandas as pd
df = pd.DataFrame.from_dict({'a': ['a', 'b', 'c']})
s = BytesIO()
f = gzip.GzipFile(fileobj=s, mode='wb', filename='file.csv')
df.to_csv(f)
s.seek(0)
content = s.getvalue()

However, in Python 3.6 (Pandas 0.22.0), the same code throws error when calling to_csv:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lib/python3.6/site-packages/pandas/core/frame.py", line 1524, in to_csv
    formatter.save()
  File "lib/python3.6/site-packages/pandas/io/formats/format.py", line 1652, in save
    self._save()
  File "lib/python3.6/site-packages/pandas/io/formats/format.py", line 1740, in _save
    self._save_header()
  File "lib/python3.6/site-packages/pandas/io/formats/format.py", line 1708, in _save_header
    writer.writerow(encoded_labels)
  File "miniconda3/lib/python3.6/gzip.py", line 260, in write
    data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'

How should I resolve this? Do I need to alter the GzipFile object somehow for to_csv to handle it properly?

To clarify, I want to create the gzipped file in-memory (the content variable) so that I can save it to Amazon S3 using Boto 3 put_object later.

like image 208
Waiski Avatar asked Apr 26 '18 09:04

Waiski


People also ask

Does pandas to_csv overwrite?

Does pandas To_csv overwrite? If the file already exists, it will be overwritten. If no path is given, then the Frame will be serialized into a string, and that string will be returned.

What does to_csv do in pandas?

Pandas DataFrame to_csv() function converts DataFrame into CSV data. We can pass a file object to write the CSV data into a file. Otherwise, the CSV data is returned in the string format.


1 Answers

You can utilise StringIO:

from io import StringIO
buf = StringIO()
df.to_csv(buf)
f = gzip.GzipFile(fileobj=s, mode='wb', filename='file.csv')
f.write(buf.getvalue().encode())
f.flush()

Note also the added f.flush() - according to my experience without this line the GzipFile may in some cases randomly not flush the data, resulting in corrupt archive.

Or as a complete example based on your code:

from io import BytesIO
import gzip
import pandas as pd
from io import StringIO
df = pd.DataFrame.from_dict({'a': ['a', 'b', 'c']})
s = BytesIO()
buf = StringIO()
f = gzip.GzipFile(fileobj=s, mode='wb', filename='file.csv')
df.to_csv(buf)
f.write(buf.getvalue().encode())
f.flush()
s.seek(0)
content = s.getvalue()
like image 166
Roland Pihlakas Avatar answered Oct 12 '22 23:10

Roland Pihlakas