Improving the performance of pandas groupby

Tags:

I have a machine learning application written in Python which includes a data processing step. When I wrote it, I initially did the data processing on Pandas DataFrames, but when this lead to abysmal performance I eventually rewrote it using vanilla Python, with for loops instead of vectorized operations and lists and dicts instead of DataFrames and Series. To my surprise, the performance of the code written in vanilla Python ended up being far higher than that of the code written using Pandas.

As my handcoded data processing code is substantially bigger and messier than the original Pandas code, I haven't quite given up on using Pandas, and I'm currently trying to optimize the Pandas code without much success.

The core of the data processing step consists of the following: I first divide the rows into several groups, as the data consists of several thousand time series (one for each "individual"), and I then do the same data processing on each group: a lot of summarization, combining different columns into new ones, etc.

I profiled my code using Jupyter Notebook's lprun, and the bulk of the time is spent on the following and other similar lines:

grouped_data = data.groupby('pk') data[[v + 'Diff' for v in val_cols]] = grouped_data[val_cols].transform(lambda x: x - x.shift(1)).fillna(0) data[[v + 'Mean' for v in val_cols]] = grouped_data[val_cols].rolling(4).mean().shift(1).reset_index()[val_cols] (...)

...a mix of vectorized and non-vectorized processing. I understand that the non-vectorized operations won't be faster than my handwritten for loops, since that's basically what they are under the hood, but how can they be so much slower? We're talking about a performance degradation of 10-20x between my handwritten code and the Pandas code.

Am I doing something very, very wrong?

324

asked Nov 20 '17 13:11

haroba

1 Answers

No, I don't think you should give up on pandas. There's definitely better ways to do what you're trying to. The trick is to avoid apply/transform in any form as much as possible. Avoid them like the plague. They're basically implemented as for loops, so you might as well directly use python for loops which operate at C speed and give you better performance.

The real speed gain is where you get rid of the loops and use pandas' functions that implicitly vectorise their operations. For example, your first line of code can be simplified greatly, as I show you soon.

In this post I outline the setup process, and then, for each line in your question, offer an improvement, along with a side-by-side comparison of the timings and correctness.

Setup

data = {'pk' : np.random.choice(10, 1000)}  data.update({'Val{}'.format(i) : np.random.randn(1000) for i in range(100)})  df = pd.DataFrame(data)

g = df.groupby('pk') c = ['Val{}'.format(i) for i in range(100)]

`transform` + `sub` + `shift` → `diff`

Your first line of code can be replaced with a simple diff statement:

v1 = df.groupby('pk')[c].diff().fillna(0)

Sanity Check

v2 = df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0)  np.allclose(v1, v2) True

Performance

%timeit df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0) 10 loops, best of 3: 44.3 ms per loop  %timeit df.groupby('pk')[c].diff(-1).fillna(0) 100 loops, best of 3: 9.63 ms per loop

Removing redundant indexing operations

As far as your second line of code is concerned, I don't see too much room for improvement, although you can get rid of the reset_index() + [val_cols] call if your groupby statement is not considering pk as the index:

g = df.groupby('pk', as_index=False)

Your second line of code then reduces to:

v3 = g[c].rolling(4).mean().shift(1)

Sanity Check

g2 = df.groupby('pk') v4 = g2[c].rolling(4).mean().shift(1).reset_index()[c]  np.allclose(v3.fillna(0), v4.fillna(0)) True

Performance

%timeit df.groupby('pk')[c].rolling(4).mean().shift(1).reset_index()[c] 10 loops, best of 3: 46.5 ms per loop  %timeit df.groupby('pk', as_index=False)[c].rolling(4).mean().shift(1) 10 loops, best of 3: 41.7 ms per loop

Note that timings vary on different machines, so make sure you test your code thoroughly to make sure there is indeed an improvement on your data.

While the difference this time isn't as much, you can appreciate the fact that there are improvements that you can make! This could possibly make a much larger impact for larger data.

Afterword

In conclusion, most operations are slow because they can be sped up. The key is to get rid of any approach that does not use vectorization.

To this end, it is sometimes beneficial to step out of pandas space and step into numpy space. Operations on numpy arrays or using numpy tend to be much faster than pandas equivalents (for example, np.sum is faster than pd.DataFrame.sum, and np.where is faster than pd.DataFrame.where, and so on).

Sometimes, loops cannot be avoided. In which case, you can create a basic looping function which you can then vectorise using numba or cython. Examples of that are here at Enhancing Performance, straight from the horses mouth.

In still other cases, your data is just too big to reasonably fit into numpy arrays. In this case, it would be time to give up and switch to dask or spark, both of which offer high performance distributed computational frameworks for working with big data.

answered Sep 29 '22 23:09

cs95

Related questions
                            
                                How best to parse a simple grammar?
                            
                                How to get rid of double backslash in python windows file path string? [duplicate]
                            
                                Python logging.DEBUG level doesn't logging
                            
                                ImportError: DLL load failed: %1 is not a valid Win32 application
                            
                                How can I rotate a matplotlib plot through 90 degrees?
                            
                                OS X - Deciding between anaconda and homebrew Python environments
                            
                                Anaconda: Install specific packages from specific channels using environment.yml
                            
                                Downsample array in Python
                            
                                Python requests.exception.ConnectionError: connection aborted "BadStatusLine"
                            
                                PIP Constraints Files
                            
                                How to run cloned Django project?
                            
                                Get list of Cache Keys in Django
                            
                                NumPy and SciPy - Difference between .todense() and .toarray()
                            
                                How to run a single line or selected code in a Jupyter Notebook or JupyterLab cell?
                            
                                Using absolute unix paths in windows with python
                            
                                Why isn't SQLAlchemy default column value available before object is committed?
                            
                                How to convert ndarray to array?
                            
                                functools.partial wants to use a positional argument as a keyword argument
                            
                                Python Asynchronous Comprehensions - how do they work?
                            
                                Create large random boolean matrix with numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Improving the performance of pandas groupby

Tags:

python

pandas

haroba

People also ask

1 Answers

Setup

`transform` + `sub` + `shift` → `diff`

Removing redundant indexing operations

Afterword

cs95

Recent Activity

Donate For Us

Improving the performance of pandas groupby

Tags:

python

pandas

haroba

People also ask

1 Answers

Setup

transform + sub + shift → diff

Removing redundant indexing operations

Afterword

cs95

Related questions

Recent Activity

Donate For Us

`transform` + `sub` + `shift` → `diff`