Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas to_csv zip format

I am having a peculiar problem when writing zip files through to_csv.

Using GZIP:

df.to_csv(path_or_buf = 'sample.csv.gz', compression="gzip", index = None, sep = ",", header=True, encoding='utf-8-sig')

gives a neat gzip file with name 'sample.csv.gz' and inside it I get my csv 'sample.csv'

However, things change when using ZIP

df.to_csv(path_or_buf = 'sample.csv.zip', compression="zip", index = None, sep = ",", header=True, encoding='utf-8-sig')

gives a zip file with name 'sample.csv.zip', but inside it the csv has been renamed to 'sample.csv.zip' as well. Removing the extra '.zip' from the file gives the csv back.

How can I implement zip extension without this issue? I need to have zip files as a requirement that I can't bypass. I am using python 2.7 on windows 10 machine.

Thanks in advance for help.

like image 653
Kshitiz Avatar asked Mar 13 '19 05:03

Kshitiz


People also ask

Can Pandas read zip files?

Yes you can. If you want to read a zipped or a tar. gz file into pandas dataframe, the read_csv methods includes this particular implementation.

How do I zip a Pandas DataFrame?

One of the way to create Pandas DataFrame is by using zip() function. You can use the lists to create lists of tuples and create a dictionary from it. Then, this dictionary can be used to construct a dataframe. zip() function creates the objects and that can be used to produce single item at a time.

What does to_csv do in Pandas?

Pandas DataFrame to_csv() function converts DataFrame into CSV data. We can pass a file object to write the CSV data into a file. Otherwise, the CSV data is returned in the string format.

What does to_csv mean in Python?

The to_csv() function is used to write object to a comma-separated values (csv) file.


Video Answer


3 Answers

It is pretty straightforward in pandas since version 1.0.0 using dict as compression options:

filename = 'sample'
compression_options = dict(method='zip', archive_name=f'{filename}.csv')
df.to_csv(f'{filename}.zip', compression=compression_options, ...)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

like image 178
Pero Avatar answered Oct 20 '22 07:10

Pero


As the thread linked in the comment discusses, ZIP's directory-like nature makes it hard to do what you want without making a lot of assumptions or complicating the arguments for to_csv

If your goal is to write the data directly to a ZIP file, that's harder than you'd think.

If you can bear temporarily writing your data to the filesystem, you can use Python's zipfile module to put that file in a ZIP with the name you preferred, and then delete the file.


import zipfile
import os

df.to_csv('sample.csv',index=None,sep=",",header=True,encoding='utf-8-sig')

with zipfile.ZipFile('sample.zip', 'w') as zf:
    zf.write('sample.csv')
os.remove('sample.csv')

like image 26
Joe Germuska Avatar answered Oct 20 '22 08:10

Joe Germuska


Since Pandas 1.0.0 it's possible to set compression using to_csv().

Example in one line:

df.to_csv('sample.zip', compression={'method': 'zip', 'archive_name': 'sample.csv'})
like image 22
washolive Avatar answered Oct 20 '22 07:10

washolive