I have the following problem. I have a huge csv file and want to load it with multiprocessing. Pandas needs 19 seconds for a example file with 500000 rows and 130 colums with different dtypes. I tried dask because i want to multiprocess the reading. But this tooks much longer and I wonder why. I have 32 cores. and tried this:
import dask.dataframe as dd
import dask.multiprocessing
dask.config.set(scheduler='processes')
df = dd.read_csv(filepath,
sep='\t',
blocksize=1000000,
)
df = df.compute(scheduler='processes') # convert to pandas
When reading a huge file from disk, the bottleneck is the IO. As Pandas is highly optimized with a C parsing engine, there is very little to gain. Any attempt to use multi-processing or multi-threading is likely to be less performant, because you will spend the same time for loading the data from the disk, and only add some overhead for synchronizing the different processes or threads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With