Saving a Pandas dataframe to gzipped csv in memory works like this in Python 2.7 (Pandas 0.22.0):
from io import BytesIO
import gzip
import pandas as pd
df = pd.DataFrame.from_dict({'a': ['a', 'b', 'c']})
s = BytesIO()
f = gzip.GzipFile(fileobj=s, mode='wb', filename='file.csv')
df.to_csv(f)
s.seek(0)
content = s.getvalue()
However, in Python 3.6 (Pandas 0.22.0), the same code throws error when calling to_csv
:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lib/python3.6/site-packages/pandas/core/frame.py", line 1524, in to_csv
formatter.save()
File "lib/python3.6/site-packages/pandas/io/formats/format.py", line 1652, in save
self._save()
File "lib/python3.6/site-packages/pandas/io/formats/format.py", line 1740, in _save
self._save_header()
File "lib/python3.6/site-packages/pandas/io/formats/format.py", line 1708, in _save_header
writer.writerow(encoded_labels)
File "miniconda3/lib/python3.6/gzip.py", line 260, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'
How should I resolve this? Do I need to alter the GzipFile
object somehow for to_csv
to handle it properly?
To clarify, I want to create the gzipped file in-memory (the content
variable) so that I can save it to Amazon S3 using Boto 3 put_object
later.
Does pandas To_csv overwrite? If the file already exists, it will be overwritten. If no path is given, then the Frame will be serialized into a string, and that string will be returned.
Pandas DataFrame to_csv() function converts DataFrame into CSV data. We can pass a file object to write the CSV data into a file. Otherwise, the CSV data is returned in the string format.
You can utilise StringIO
:
from io import StringIO
buf = StringIO()
df.to_csv(buf)
f = gzip.GzipFile(fileobj=s, mode='wb', filename='file.csv')
f.write(buf.getvalue().encode())
f.flush()
Note also the added f.flush()
- according to my experience without this line the GzipFile
may in some cases randomly not flush the data, resulting in corrupt archive.
Or as a complete example based on your code:
from io import BytesIO
import gzip
import pandas as pd
from io import StringIO
df = pd.DataFrame.from_dict({'a': ['a', 'b', 'c']})
s = BytesIO()
buf = StringIO()
f = gzip.GzipFile(fileobj=s, mode='wb', filename='file.csv')
df.to_csv(buf)
f.write(buf.getvalue().encode())
f.flush()
s.seek(0)
content = s.getvalue()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With