Pandas groupby + transform and multiple columns

Tags:

To obtain results executed on groupby-data with the same level of detail as the original DataFrame (same observation count) I have used the transform function.

Example: Original dataframe

name, year, grade
Jack, 2010, 6
Jack, 2011, 7
Rosie, 2010, 7
Rosie, 2011, 8

After groupby transform

name, year, grade, average grade
Jack, 2010, 6, 6.5
Jack, 2011, 7, 6.5
Rosie, 2010, 7, 7.5
Rosie, 2011, 8, 7.5

However, with more advanced functions based on multiple columns things get more complicated. What puzzles me is that I seem to be unable to access multiple columns in a groupby-transform combination.

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[1,2,3,4,5,6],
               'c':['q', 'q', 'q', 'q', 'w', 'w'],  
               'd':['z','z','z','o','o','o']})

def f(x):
 y=sum(x['a'])+sum(x['b'])
 return(y)

df['e'] = df.groupby(['c','d']).transform(f)

Gives me:

KeyError: ('a', 'occurred at index a')

Though I know that following does work:

df.groupby(['c','d']).apply(f)

What causes this behavior and how can I obtain something like this:

a   b   c   d   e
1   1   q   z   12
2   2   q   z   12
3   3   q   z   12
4   4   q   o   8
5   5   w   o   22
6   6   w   o   22

627

asked Nov 08 '18 16:11

Willem

2 Answers

for this particular case you could do:

g = df.groupby(['c', 'd'])

df['e'] = g.a.transform('sum') + g.b.transform('sum')

df
# outputs

   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22

if you can construct the final result by a linear combination of the independent transforms on the same groupby, this method would work.

otherwise, you'd use a groupby-apply and then merge back to the original df.

example:

_ = df.groupby(['c','d']).apply(lambda x: sum(x.a+x.b)).rename('e').reset_index()
df.merge(_, on=['c','d'])
# same output as above.

answered Oct 23 '22 11:10

Haleemur Ali

You can use GroupBy + transform with sum twice:

df['e'] = df.groupby(['c', 'd'])[['a', 'b']].transform('sum').sum(1)

print(df)

   a  b  c  d   e
0  1  1  q  z  12
1  2  2  q  z  12
2  3  3  q  z  12
3  4  4  q  o   8
4  5  5  w  o  22
5  6  6  w  o  22

answered Oct 23 '22 10:10

jpp

Related questions
                            
                                How to know the number of tree created in XGBoost
                            
                                Why is super().__init__(*args,**kwargs) being used when class doesn't specify a superclass?
                            
                                How can I get data from Django Headers?
                            
                                pandas read in MultiIndex data from csv file
                            
                                Python 3 - Google Drive API: AttributeError: 'Resource' object has no attribute 'children'
                            
                                Gensim Word2Vec select minor set of word vectors from pretrained model
                            
                                dask: specify number of processes
                            
                                Mouseover event for a PyQT5 Label
                            
                                Calculate days until your next birthday in python
                            
                                How could I detect subtypes in pandas object columns?
                            
                                Python : Django TypeError: object() takes no parameters
                            
                                Pandas: Conditionally replace values based on other columns values
                            
                                Django 2.1 - 'functools.partial' object has no attribute '__name__'
                            
                                How to convert RGB images to grayscale in PyTorch dataloader?
                            
                                How to get column name for second largest row value in pandas DataFrame
                            
                                fastest way to share data between a C++ and Python program? [closed]
                            
                                Why pytorch DataLoader behaves differently on numpy array and list?
                            
                                count consecutive days python dataframe
                            
                                Way for Pathlib Path.rename() to create intermediate directories?
                            
                                Modifying a pytorch tensor and then getting the gradient lets the gradient not work

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas groupby + transform and multiple columns

Tags:

python

pandas

pandas-groupby

Willem

People also ask

2 Answers

Haleemur Ali

jpp

Recent Activity

Donate For Us