Pandas to_csv() slow saving large dataframe

Tags:

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

import os import glob import pandas as pd  src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))  # 1 - Takes 2 min to read 20m records from 30 files for file_ in sorted(src_files):     stage = pd.DataFrame()     iter_csv = pd.read_csv(file_                      , sep=','                      , index_col=False                      , header=0                      , low_memory=False                      , iterator=True                      , chunksize=100000                      , compression='gzip'                      , memory_map=True                      , encoding='utf-8')      df = pd.concat([chunk for chunk in iter_csv])     stage = stage.append(df, ignore_index=True)  # 2 - Takes 55 min to write 20m records from one dataframe stage.to_csv('output.csv'              , sep='|'              , header=True              , index=False              , chunksize=100000              , encoding='utf-8')  del stage

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

Thank you,

853

asked Nov 17 '16 16:11

Kimi Merroll

2 Answers

Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

stage.to_hdf(r'path/file.h5', key='stage', mode='w')

answered Sep 18 '22 18:09

Amir F

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

Writing compressed file could speedup writing up to 10x

    stage.to_csv('output.csv.gz'          , sep='|'          , header=True          , index=False          , chunksize=100000          , compression='gzip'          , encoding='utf-8')

Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

answered Sep 19 '22 18:09

Frane

Related questions
                            
                                Is this a forwarding reference?
                            
                                Android WebView err_unknown_url_scheme
                            
                                Warning: Input string not available in this locale
                            
                                Is it possible to have a [OneTimeSetup] for ALL tests?
                            
                                What is the proper way of sending a large amount of data over sockets in Python?
                            
                                Using multiple javascript service workers at the same domain but different folders
                            
                                How can I get a unified git diff in Visual Studio Code?
                            
                                Does TensorFlow job use multiple cores by default?
                            
                                init state without constructor in react
                            
                                Error in installing opencv3 with homebrew and python3
                            
                                Can I call APIs in componentWillMount in React?
                            
                                Merging two dataframes based on a date between two other dates without a common column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With