Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask compute is very slow

I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python

 import dask.dataframe as dd                                          
 dask_df = dd.read_csv(fullPath)
 ............
 for index , row in uniqueURLs.iterrows():
   print(index);
   results = dask_df[dask_df['URL'] == row['URL']]
   count = results.size.compute();

But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute(). So If I removed the line that computes the size of results my program turns to be very fast. Can someone explain this? How can I make it faster?

like image 350
Neno M. Avatar asked Oct 07 '18 11:10

Neno M.


People also ask

Why is DASK compute slow?

When the Dask DataFrame contains data that's split across multiple nodes in a cluster, then compute() may run slowly. It can also cause out of memory errors if the data isn't small enough to fit in the memory of a single machine. Dask was created to solve the memory issues of using pandas on a single machine.

Is DASK faster than Pandas?

Let's start with the simplest operation — read a single CSV file. To my surprise, we can already see a huge difference in the most basic operation. Datatable is 70% faster than pandas while dask is 500% faster! The outcomes are all sorts of DataFrame objects which have very identical interfaces.

Is DASK faster than multiprocessing?

In your example, dask is slower than python multiprocessing, because you don't specify the scheduler, so dask uses the multithreading backend, which is the default. As mdurant has pointed out, your code does not release the GIL, therefore multithreading cannot execute the task graph in parallel.


1 Answers

But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute().

You are misunderstanding how dask.dataframe works. The line results = dask_df[dask_df['URL'] == row['URL']] performs no computation on the dataset. It merely stores instructions as to computations which can be triggered at a later point.

All computations are applied only with the line count = results.size.compute(). This is entirely expected, as dask works lazily.

Think of a generator and a function such as list which can exhaust a generator. The generator itself is lazy, but will trigger operations when called by a function. dask.dataframe is also lazy, but works smartly by forming an internal "chain" of sequential operations.

You should see Laziness and Computing from the docs for more information.

like image 53
jpp Avatar answered Sep 30 '22 15:09

jpp