Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a faster way to export data from Dask DataFrame to CSV?

Tags:

python

csv

dask

I am reading CSV file (10 GB) using Dask. Then after performing some operations, I am Exporting file in CSV format using to_csv. But problem is that exporting this file is taking around 27 Minutes (According to ProgressBar Diagnostics).

CSV file includes 350 columns with one column of timestamp and other column's datatype are set to float64.

  • Machine Specs:
    • Intel i7-4610M @ 3.00 GHz
    • 8 GB DDR3 RAM
    • 500 GB SSD
    • Windows 10 Pro

I have tried exporting in separate files like to_csv('filename-*.csv') and also have tried without including .csv. So, Dask exports file with extension of .part. But doing this also takes same time as mentioned above.

I think this should not be an issue with I/O operations as I am using SSD. But I am not sure about that.

Here is my code (simplified):

df = dd.read_csv('path\\to\\csv')
# Doing some operations using df.loc
df.to_csv('export.csv', single_file=True)

I am using Dask v2.6.0.

Expected output --> complete this process in less time without changing specs of machine.

Is there anyway, I can export this file in less time?

like image 625
Pritesh K. Avatar asked Oct 27 '22 06:10

Pritesh K.


1 Answers

By default dask dataframe uses the multi-threaded scheduler. This is optimal for most pandas operations, but read_csv partially holds onto the GIL, so you might want to try using the multi-processing or dask.distributed schedulers.

See more information about that here: https://docs.dask.org/en/latest/scheduling.html

If you can, I also recommend using a more efficient file format, like Parquet

https://docs.dask.org/en/latest/dataframe-best-practices.html#store-data-in-apache-parquet-format

like image 101
MRocklin Avatar answered Nov 02 '22 01:11

MRocklin