I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python
import dask.dataframe as dd
dask_df = dd.read_csv(fullPath)
............
for index , row in uniqueURLs.iterrows():
print(index);
results = dask_df[dask_df['URL'] == row['URL']]
count = results.size.compute();
But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute(). So If I removed the line that computes the size of results my program turns to be very fast. Can someone explain this? How can I make it faster?
When the Dask DataFrame contains data that's split across multiple nodes in a cluster, then compute() may run slowly. It can also cause out of memory errors if the data isn't small enough to fit in the memory of a single machine. Dask was created to solve the memory issues of using pandas on a single machine.
Let's start with the simplest operation — read a single CSV file. To my surprise, we can already see a huge difference in the most basic operation. Datatable is 70% faster than pandas while dask is 500% faster! The outcomes are all sorts of DataFrame objects which have very identical interfaces.
In your example, dask is slower than python multiprocessing, because you don't specify the scheduler, so dask uses the multithreading backend, which is the default. As mdurant has pointed out, your code does not release the GIL, therefore multithreading cannot execute the task graph in parallel.
But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute().
You are misunderstanding how dask.dataframe
works. The line results = dask_df[dask_df['URL'] == row['URL']]
performs no computation on the dataset. It merely stores instructions as to computations which can be triggered at a later point.
All computations are applied only with the line count = results.size.compute()
. This is entirely expected, as dask
works lazily.
Think of a generator and a function such as list
which can exhaust a generator. The generator itself is lazy, but will trigger operations when called by a function. dask.dataframe
is also lazy, but works smartly by forming an internal "chain" of sequential operations.
You should see Laziness and Computing from the docs for more information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With