Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas to_csv() slow saving large dataframe

Tags:

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

import os import glob import pandas as pd  src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))  # 1 - Takes 2 min to read 20m records from 30 files for file_ in sorted(src_files):     stage = pd.DataFrame()     iter_csv = pd.read_csv(file_                      , sep=','                      , index_col=False                      , header=0                      , low_memory=False                      , iterator=True                      , chunksize=100000                      , compression='gzip'                      , memory_map=True                      , encoding='utf-8')      df = pd.concat([chunk for chunk in iter_csv])     stage = stage.append(df, ignore_index=True)  # 2 - Takes 55 min to write 20m records from one dataframe stage.to_csv('output.csv'              , sep='|'              , header=True              , index=False              , chunksize=100000              , encoding='utf-8')  del stage 

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

Thank you,

like image 853
Kimi Merroll Avatar asked Nov 17 '16 16:11

Kimi Merroll


People also ask

Does pandas to_csv overwrite?

When you write pandas DataFrame to an existing CSV file, it overwrites the file with the new contents. To append a DataFrame to an existing CSV file, you need to specify the append write mode using mode='a' .

What happens when we use PD to_csv ()?

Pandas DataFrame to_csv() function exports the DataFrame to CSV format. If a file argument is provided, the output will be the CSV file. Otherwise, the return value is a CSV format like string. sep: Specify a custom delimiter for the CSV output, the default is a comma.

Is PyArrow faster than pandas?

To summarize, if your apps save/load data from disk frequently, then it's a wise decision to leave these operations to PyArrow. Heck, it's 7 times faster for the identical file format.


2 Answers

Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

stage.to_hdf(r'path/file.h5', key='stage', mode='w') 
like image 88
Amir F Avatar answered Sep 18 '22 18:09

Amir F


You are reading compressed files and writing plaintext file. Could be IO bottleneck.

Writing compressed file could speedup writing up to 10x

    stage.to_csv('output.csv.gz'          , sep='|'          , header=True          , index=False          , chunksize=100000          , compression='gzip'          , encoding='utf-8') 

Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

like image 33
Frane Avatar answered Sep 19 '22 18:09

Frane