Normalize DataFrame by group

Tags:

pandas

Let's say that I have some data generated as follows:

N = 20
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3

and then I create some categorization variable:

indx = np.random.randint(0,3,size=N).astype(np.int32)

and generate a DataFrame:

import pandas as pd
df = pd.DataFrame(np.hstack((data, indx[:,None])), 
             columns=['a%s' % k for k in range(m)] + [ 'indx'])

I can get the mean value, per group as:

df.groubpy('indx').mean()

What I'm unsure of how to do is to then subtract the mean off of each group, per-column in the original data, so that the data in each column is normalized by the mean within group. Any suggestions would be appreciated.

562

asked Sep 25 '14 19:09

JoshAdel

4 Answers

In [10]: df.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())

should do it.

160

answered Oct 19 '22 08:10

TomAugspurger

If the data contains many groups (thousands or more), the accepted answer using a lambda may take a very long time to compute. A fast solution would be:

groups = df.groupby("indx")
mean, std = groups.transform("mean"), groups.transform("std")
normalized = (df[mean.columns] - mean) / std

Explanation and benchmarking

The accepted answer suffers from a performance problem using apply with a lambda. Even though groupby.transform itself is fast, as are the already vectorized calls in the lambda function (.mean(), .std() and the subtraction), the call to the pure Python lambda function itself for each group creates a considerable overhead.

This can be avoided by using pure vectorized Pandas/Numpy calls and not writing any Python method, as shown in ErnestScribbler's answer.

We can get around the headache of merging and naming the columns by leveraging the broadcasting abilities of .transform. Let's put the solution from above into a method for benchmarking:

def normalize_by_group(df, by):
    groups = df.groupby(by)
    # computes group-wise mean/std,
    # then auto broadcasts to size of group chunk
    mean = groups.transform("mean")
    std = groups.transform("std")
    normalized = (df[mean.columns] - mean) / std
    return normalized

I changed the data generation from the original question to allow for more groups:

def gen_data(N, num_groups):
    m = 3
    data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
    indx = np.random.randint(0,num_groups,size=N).astype(np.int32)

    df = pd.DataFrame(np.hstack((data, indx[:,None])), 
                      columns=['a%s' % k for k in range(m)] + [ 'indx'])
    return df

With only two groups (thus only two Python function calls), the lambda version is only about 1.8x slower than the numpy code:

In: df2g = gen_data(10000, 2)  # 3 cols, 10000 rows, 2 groups

In: %timeit normalize_by_group(df2g, "indx")
6.61 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit df2g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
12.3 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Increasing the number of groups to 1000, and the runtime issue becomes apparent. The lambda version is 370x slower than the numpy code:

In: df1000g = gen_data(10000, 1000)  # 3 cols, 10000 rows, 1000 groups

In: %timeit normalize_by_group(df1000g, "indx")
7.5 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit df1000g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
2.78 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

answered Oct 19 '22 08:10

w-m

The accepted answer works and is elegant. Unfortunately, for large datasets I think performance-wise using .transform() is much much slower than doing the less elegant following (illustrated with a single column 'a0'):

means_stds = df.groupby('indx')['a0'].agg(['mean','std']).reset_index()
df = df.merge(means_stds,on='indx')
df['a0_normalized'] = (df['a0'] - df['mean']) / df['std']

To do it for multiple columns you'll have to figure out the merge. My suggestion would be to flatten the multiindex columns from aggregation as in this answer and then merge and normalize for each column separately:

means_stds = df.groupby('indx')[['a0','a1']].agg(['mean','std']).reset_index()
means_stds.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in means_stds.columns]
df = df.merge(means_stds,on='indx')
for col in ['a0','a1']:
    df[col+'_normalized'] = ( df[col] - df[col+'|mean'] ) / df[col+'|std']

answered Oct 19 '22 10:10

ErnestScribbler

Although this is not the prettiest solution, you could do something like this:

indx = df['indx'].copy()
for indices in df.groupby('indx').groups.values():
    df.loc[indices] -= df.loc[indices].mean()
df['indx'] = indx

answered Oct 19 '22 09:10

Mike

Related questions
                            
                                How do you alias a type in Python?
                            
                                How to generate all combination from values in dict of lists in Python
                            
                                How to get the localStorage with Python and Selenium WebDriver
                            
                                Python's bz2 module not compiled by default
                            
                                Why am I getting this error in python ? (httplib)
                            
                                Finding the length of an mp3 file
                            
                                Bits list to integer in Python
                            
                                Display message when hovering over something with mouse cursor in Python
                            
                                How can I overwrite/print over the current line in Windows command line?
                            
                                How to parse a RFC 2822 date/time into a Python datetime?
                            
                                Checking on a thread / remove from list
                            
                                How to get the Tkinter Label text?
                            
                                PyCharm can not resolve PyGObject 3.0, but code runs fine
                            
                                Get first and second highest values in pandas columns
                            
                                Python pip package RequestsDependencyWarning when installing elastic-search-curator
                            
                                Python update object from dictionary
                            
                                In Python 2.5, how do I kill a subprocess?
                            
                                What is the equivalent in PHP for Python's pass statement?
                            
                                Verify rabbitmq credentials are valid
                            
                                How to install Flask on Windows?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With