Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

move from pandas to dask to utilize all local cpu cores

Recently I stumbled upon http://dask.pydata.org/en/latest/ As I have some pandas code which only runs on a single core I wonder how to make use of my other CPU cores. Would dask work well to use all (local) CPU cores? If yes how compatible is it to pandas?

Could I use multiple CPUs with pandas? So far I read about releasing the GIL but that all seems rather complicated.

like image 345
Georg Heiler Avatar asked Mar 07 '17 13:03

Georg Heiler


People also ask

Does dask use multiple cores?

If you're using the traditional approach with pandas or NumPy, you might end up suspending the project since these libraries/tools can't perform multiprocessing. Dask helps you read the data on multiple cores and allows the processing through parallelism.

Is dask faster than pandas?

Let's start with the simplest operation — read a single CSV file. To my surprise, we can already see a huge difference in the most basic operation. Datatable is 70% faster than pandas while dask is 500% faster!

What is the reason to use dask instead of pandas Why would you want to parallelize your code?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

Does pandas apply use all cores?

Pandas' apply(~) method uses a single core, which means that a single thread is used to perform this method. If your machine has multiple cores, then you would be able to execute the apply(~) method in parallel.


1 Answers

Would dask work well to use all (local) CPU cores?

Yes.

how compatible is it to pandas?

Pretty compatible. Not 100%. You can mix in Pandas and NumPy and even pure Python stuff with Dask if needed.

Could I use multiple CPUs with pandas?

You could. The easiest way would be to use multiprocessing and keep your data separate--have each job independently read from disk and write to disk if you can do so efficiently. A significantly harder way is using mpi4py which is most useful if you have a multi-computer environment with a professional administrator.

like image 177
John Zwinck Avatar answered Sep 18 '22 10:09

John Zwinck