Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pass Multiple Columns to groupby.transform

Tags:

python

pandas

I understand that when you call a groupby.transform with a DataFrame column, the column is passed to the function that transforms the data. But what I cannot understand is how to pass multiple columns to the function.

people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']

Now I can easily demean that data etc. but what I can't seem to do properly is to transform data inside groups using multiple column values as parameters of the function. For example if I wanted to add a column 'f' that took the value a.mean() - b.mean() * c for each observation how can that be achived using the transform method.

I have tried variants of the following

people['f'] = float(NA)
Grouped = people.groupby(key)
def TransFunc(col1, col2, col3):
    return col1.mean() - col2.mean() * col3
Grouped.f.transform(TransFunc(Grouped['a'], Grouped['b'], Grouped['c']))

But this is clearly wrong. I have also trued to wrap the function in a lamba but can't quite make that work either.

I am able to achieve the result by iterating through the groups in the following manner:

for group in Grouped:
    Amean = np.mean(list(group[1].a))
    Bmean = np.mean(list(group[1].b))
    CList = list(group[1].c)
    IList = list(group[1].index)

    for y in xrange(len(CList)):
        people['f'][IList[y]] = (Amean - Bmean) * CList[y]

But that does not seem a satisfactory solution, particulalry if the index is non-unique. Also I know this must be possible using groupby.transform.

To generalise the question: how does one write functions for transforming data that have parameters that involve using values from multiple columns?

Help appreciated.

like image 466
Woody Pride Avatar asked Dec 03 '25 14:12

Woody Pride


1 Answers

You can use apply() method:

import numpy as np
import pandas as pl
np.random.seed(0)

people2 = pd.DataFrame(np.random.randn(5, 5), 
                      columns=['a', 'b', 'c', 'd', 'e'], 
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']

Grouped = people2.groupby(key)

def f(df):
    df["f"] = (df.a.mean() - df.b.mean())*df.c
    return df

people2 = Grouped.apply(f)
print people2

If you want some generalize method:

Grouped = people2.groupby(key)

def f(a, b, c, **kw):
    return (a.mean() - b.mean())*c

people2["f"] = Grouped.apply(lambda df:f(**df))
print people2
like image 113
HYRY Avatar answered Dec 06 '25 10:12

HYRY



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!