I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).
import os import glob import pandas as pd src_files = glob.glob(os.path.join('/my/path', "*.csv.gz")) # 1 - Takes 2 min to read 20m records from 30 files for file_ in sorted(src_files): stage = pd.DataFrame() iter_csv = pd.read_csv(file_ , sep=',' , index_col=False , header=0 , low_memory=False , iterator=True , chunksize=100000 , compression='gzip' , memory_map=True , encoding='utf-8') df = pd.concat([chunk for chunk in iter_csv]) stage = stage.append(df, ignore_index=True) # 2 - Takes 55 min to write 20m records from one dataframe stage.to_csv('output.csv' , sep='|' , header=True , index=False , chunksize=100000 , encoding='utf-8') del stage
I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.
Thank you,
When you write pandas DataFrame to an existing CSV file, it overwrites the file with the new contents. To append a DataFrame to an existing CSV file, you need to specify the append write mode using mode='a' .
Pandas DataFrame to_csv() function exports the DataFrame to CSV format. If a file argument is provided, the output will be the CSV file. Otherwise, the return value is a CSV format like string. sep: Specify a custom delimiter for the CSV output, the default is a comma.
To summarize, if your apps save/load data from disk frequently, then it's a wise decision to leave these operations to PyArrow. Heck, it's 7 times faster for the identical file format.
Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)
stage.to_hdf(r'path/file.h5', key='stage', mode='w')
You are reading compressed files and writing plaintext file. Could be IO bottleneck.
Writing compressed file could speedup writing up to 10x
stage.to_csv('output.csv.gz' , sep='|' , header=True , index=False , chunksize=100000 , compression='gzip' , encoding='utf-8')
Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With