Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask read csv versus pandas read csv

I have the following problem. I have a huge csv file and want to load it with multiprocessing. Pandas needs 19 seconds for a example file with 500000 rows and 130 colums with different dtypes. I tried dask because i want to multiprocess the reading. But this tooks much longer and I wonder why. I have 32 cores. and tried this:

import dask.dataframe as dd
import dask.multiprocessing
dask.config.set(scheduler='processes')
df = dd.read_csv(filepath,  
             sep='\t',
            blocksize=1000000,
             )
df = df.compute(scheduler='processes')     # convert to pandas
like image 811
Varlor Avatar asked Jan 27 '23 16:01

Varlor


1 Answers

When reading a huge file from disk, the bottleneck is the IO. As Pandas is highly optimized with a C parsing engine, there is very little to gain. Any attempt to use multi-processing or multi-threading is likely to be less performant, because you will spend the same time for loading the data from the disk, and only add some overhead for synchronizing the different processes or threads.

like image 195
Serge Ballesta Avatar answered Jan 29 '23 06:01

Serge Ballesta