Recently I stumbled upon http://dask.pydata.org/en/latest/ As I have some pandas code which only runs on a single core I wonder how to make use of my other CPU cores. Would dask work well to use all (local) CPU cores? If yes how compatible is it to pandas?
Could I use multiple CPUs with pandas? So far I read about releasing the GIL but that all seems rather complicated.
If you're using the traditional approach with pandas or NumPy, you might end up suspending the project since these libraries/tools can't perform multiprocessing. Dask helps you read the data on multiple cores and allows the processing through parallelism.
Let's start with the simplest operation — read a single CSV file. To my surprise, we can already see a huge difference in the most basic operation. Datatable is 70% faster than pandas while dask is 500% faster!
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
Pandas' apply(~) method uses a single core, which means that a single thread is used to perform this method. If your machine has multiple cores, then you would be able to execute the apply(~) method in parallel.
Would dask work well to use all (local) CPU cores?
Yes.
how compatible is it to pandas?
Pretty compatible. Not 100%. You can mix in Pandas and NumPy and even pure Python stuff with Dask if needed.
Could I use multiple CPUs with pandas?
You could. The easiest way would be to use multiprocessing
and keep your data separate--have each job independently read from disk and write to disk if you can do so efficiently. A significantly harder way is using mpi4py
which is most useful if you have a multi-computer environment with a professional administrator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With