Dask read csv versus pandas read csv

Question

I have the following problem. I have a huge csv file and want to load it with multiprocessing. Pandas needs 19 seconds for a example file with 500000 rows and 130 colums with different dtypes. I tried dask because i want to multiprocess the reading. But this tooks much longer and I wonder why. I have 32 cores. and tried this:

import dask.dataframe as dd
import dask.multiprocessing
dask.config.set(scheduler='processes')
df = dd.read_csv(filepath,  
             sep='	',
            blocksize=1000000,
             )
df = df.compute(scheduler='processes')     # convert to pandas

Serge Ballesta · Accepted Answer

When reading a huge file from disk, the bottleneck is the IO. As Pandas is highly optimized with a C parsing engine, there is very little to gain. Any attempt to use multi-processing or multi-threading is likely to be less performant, because you will spend the same time for loading the data from the disk, and only add some overhead for synchronizing the different processes or threads.

Dask read csv versus pandas read csv

Tags:

python

pandas

csv

dask

Varlor

1 Answers

Serge Ballesta

Recent Activity

Donate For Us

Dask read csv versus pandas read csv

Tags:

python

pandas

csv

dask

Varlor

1 Answers

Serge Ballesta

Related questions

Recent Activity

Donate For Us