Why is a computation much slower within a Dask/Distributed worker?

Tags:

I have a computation which runs much slower within a Dask/Distributed worker compared to running it locally. I can reproduce it without any I/O going on, so I can rule out that it has to do with transferring data. The following code is a minimal reproducing example:

import time
import pandas as pd
import numpy as np
from dask.distributed import Client, LocalCluster


def gen_data(N=5000000):
    """ Dummy data generator """
    df = pd.DataFrame(index=range(N))
    for c in range(10):
        df[str(c)] = np.random.uniform(size=N)
    df["id"] = np.random.choice(range(100), size=len(df))
    return df


def do_something_on_df(df):
    """ Dummy computation that contains inplace mutations """
    for c in range(df.shape[1]):
        df[str(c)] = np.random.uniform(size=df.shape[0])
    return 42


def run_test():
    """ Test computation """
    df = gen_data()
    for key, group_df in df.groupby("id"):
        do_something_on_df(group_df)


class TimedContext(object):
    def __enter__(self):
        self.t1 = time.time()

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.t2 = time.time()
        print(self.t2 - self.t1)

if __name__ == "__main__":
    client = Client("tcp://10.0.2.15:8786")

    with TimedContext():
        run_test()

    with TimedContext():
        client.submit(run_test).result()

Running the test computation locally takes ~10 seconds, but it takes ~30 seconds from within Dask/Distributed. I also noticed that the Dask/Distributed workers output a lot of log messages like

distributed.core - WARNING - Event loop was unresponsive for 1.03s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.25s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.91s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.99s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.50s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 1.90s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - WARNING - Event loop was unresponsive for 2.23s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
...

which are surprising, because it's not clear what is holding the GIL in this example.

Why is there such a big performance difference? And what can I do to get the same performance?

Disclaimer: Self answering for documentation purposes...

447

asked Dec 12 '17 16:12

bluenote10

1 Answers

This behavior is a result of a very surprising behavior of Pandas. By default, Pandas __setitem__ handlers perform checks to detect chained assignments, leading to the famous SettingWithCopyWarning. When dealing with a copy, these checks issue a call to gc.collect here. Thus, code that uses __setitem__ excessively will result in excessive gc.collect calls. This can have a significant impact on performance in general, but the problem is much worse within a Dask/Distributed worker, because the garbage collection has to deal with much more Python data structures compared to running stand-alone. Most likely the hidden garbage collection calls are also the source of the GIL-holding warnings.

Therefore, the solution is to avoid these excessive gc.collect calls. There are two ways:

Avoiding to use __setitem__ on copies: Arguably the best solution, but it requires some understanding of where copies are produced. In the above example this could be achieved by changing the function call to do_something_on_df(group_df.copy()).
Disabling chained assignment checks: Simply placing pd.options.mode.chained_assignment = None at the beginning of the computation also disables the gc.collect calls.

In both cases the test computation runs much faster than before with ~3.5 seconds both locally and under Dask/Distributed. This also gets rid of the GIL-holding warnings.

Note: I have filed an issue for this on GitHub.

197

answered Oct 13 '22 23:10

bluenote10

Related questions
                            
                                Create new dataframe in pandas with dynamic names also add new column
                            
                                Python: Using list comprehensions to filter a list by a list of substrings
                            
                                Nexus repository manager as pip local server not working properly
                            
                                Use `np.diff` but assume the input starts with an extra zero
                            
                                Python TooManyRedirects: Exceeded 30 redirects
                            
                                Django settings LOGOUT_REDIRECT_URL doesn't work
                            
                                clever any() like function to check if at least n elements are True?
                            
                                How to add gaussian noise in an image in Python using PyMorph
                            
                                Python regex to match punctuation at end of string
                            
                                Is it possible to get the default background color using curses in python?
                            
                                How to remove RunTimeWarning Errors from code?
                            
                                How to find index of minimum non zero element with numpy?
                            
                                pyspark - create DataFrame Grouping columns in map type structure
                            
                                Print sample set of columns from dataframe in Pandas? [duplicate]
                            
                                Python: print variable name and value easily
                            
                                What is assigned to `variable`, in `with expression as variable`?
                            
                                Flask database migrations on heroku
                            
                                BeautifulSoup and class with spaces
                            
                                django.db.utils.IntegrityError: duplicate key value violates unique constraint "auth_permission_pkey"
                            
                                How to bind enter key to a tkinter button

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is a computation much slower within a Dask/Distributed worker?

Tags:

python

dask

distributed

bluenote10

People also ask

1 Answers

bluenote10

Recent Activity

Donate For Us