I am reading CSV file (10 GB) using Dask. Then after performing some operations, I am Exporting file in CSV format using to_csv
. But problem is that exporting this file is taking around 27 Minutes (According to ProgressBar Diagnostics).
CSV file includes 350 columns with one column of timestamp and other column's datatype are set to float64
.
I have tried exporting in separate files like to_csv('filename-*.csv')
and also have tried without including .csv
. So, Dask exports file with extension of .part
. But doing this also takes same time as mentioned above.
I think this should not be an issue with I/O operations as I am using SSD. But I am not sure about that.
Here is my code (simplified):
df = dd.read_csv('path\\to\\csv')
# Doing some operations using df.loc
df.to_csv('export.csv', single_file=True)
I am using Dask v2.6.0.
Expected output --> complete this process in less time without changing specs of machine.
Is there anyway, I can export this file in less time?
By default dask dataframe uses the multi-threaded scheduler. This is optimal for most pandas operations, but read_csv
partially holds onto the GIL, so you might want to try using the multi-processing or dask.distributed schedulers.
See more information about that here: https://docs.dask.org/en/latest/scheduling.html
If you can, I also recommend using a more efficient file format, like Parquet
https://docs.dask.org/en/latest/dataframe-best-practices.html#store-data-in-apache-parquet-format
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With