Dask compute is very slow

Tags:

I have a dataframe that consist of 5 million records. I am trying to process it using below code by leveraging the dask dataframes in python

 import dask.dataframe as dd                                          
 dask_df = dd.read_csv(fullPath)
 ............
 for index , row in uniqueURLs.iterrows():
   print(index);
   results = dask_df[dask_df['URL'] == row['URL']]
   count = results.size.compute();

But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute(). So If I removed the line that computes the size of results my program turns to be very fast. Can someone explain this? How can I make it faster?

350

asked Oct 07 '18 11:10

Neno M.

1 Answers

But I noticed that dask is very efficient in filtering dataframes BUT NOT in .compute().

You are misunderstanding how dask.dataframe works. The line results = dask_df[dask_df['URL'] == row['URL']] performs no computation on the dataset. It merely stores instructions as to computations which can be triggered at a later point.

All computations are applied only with the line count = results.size.compute(). This is entirely expected, as dask works lazily.

Think of a generator and a function such as list which can exhaust a generator. The generator itself is lazy, but will trigger operations when called by a function. dask.dataframe is also lazy, but works smartly by forming an internal "chain" of sequential operations.

You should see Laziness and Computing from the docs for more information.

answered Sep 30 '22 15:09

jpp

Related questions
                            
                                PYTHON 3.7 _tkinter.TclError: invalid command name "tixBalloon"
                            
                                How to join list of integers into one integer python
                            
                                How do I determine the binary class predicted by a convolutional neural network on Keras?
                            
                                Python reshape a list to multidimensional list
                            
                                How to shutdown process with event loop and executor
                            
                                Intensify or increase Saturation of an image
                            
                                numpy vstack inside for loop [duplicate]
                            
                                How to transpose numpy ndarray in place?
                            
                                Disable Sentry reporting when using djangos `manage.py shell`
                            
                                Using pytest.raises to inspect custom exception attributes
                            
                                itertools.cycle(iterable) vs while True
                            
                                Why does the yield expression collapse?
                            
                                plotnine - Any work around to have two plots in the same figure and print it
                            
                                Unable to build docker image with sasl python module
                            
                                Horizontal bar chart from right to left in matplotlib
                            
                                How to get the name of a QMenu Item when clicked?
                            
                                How to map the differences between two strings?
                            
                                PyCharm case sensitive auto-complete
                            
                                Python 2.7: How to compensate for missing pool.starmap?
                            
                                Formatting dates for annotating count in Django + Python 3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dask compute is very slow

Tags:

performance

python

python-3.x

dask

dask-distributed

Neno M.

People also ask

1 Answers

jpp

Recent Activity

Donate For Us