Groupby Apply Custom Function Pandas

Tags:

I'm trying to apply a custom function in pandas similar to the groupby and mutate functionality in dplyr.

What I'm trying to do is say given a pandas dataframe like this:

df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
  'category2':['a', 'b', 'a', 'b', 'a', 'b'],
  'var1':np.random.randint(0,100,6),
  'var2':np.random.randint(0,100,6)}
)

df
  category1 category2  var1  var2
0         a         a    23    59
1         a         b    54    20
2         a         a    48    62
3         b         b    45    76
4         b         a    60    26
5         b         b    13    70

apply some function that returns the same number of elements as the number of elements in the group by:

def myfunc(s):
  return [np.mean(s)] * len(s)

to get this result

df
  category1 category2  var1  var2   var3
0         a         a    23    59   35.5
1         a         b    54    20   54
2         a         a    48    62   35.5
3         b         b    45    76   29
4         b         a    60    26   60
5         b         b    13    70   29

I was thinking of something along the lines of:

df['var3'] = df.groupby(['category1', 'category2'], group_keys=False).apply(lambda x: myfunc(x.var1))

but haven't been able to get the index to match.

In R with dplyr this would be

df <- df %>%
  group_by(category1, category2) %>%
  mutate(
    var3 = myfunc(var1)
  )

So I was able to solve it by using a custom function like:

def myfunc_data(data):

  data['var3'] = myfunc(data.var1)
  return data

and

df = df.groupby(['category1', 'category2']).apply(myfunc_data)

but I guess I was still wondering if there's a way to do it without defining this custom function.

407

asked Apr 12 '19 04:04

jtanman

1 Answers

Use GroupBy.transform for return Series with same size like original DataFrame, so possible assign to new column:

np.random.seed(123)

df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
  'category2':['a', 'b', 'a', 'b', 'a', 'b'],
  'var1':np.random.randint(0,100,6),
  'var2':np.random.randint(0,100,6)}
)

df['var3'] = df.groupby(['category1', 'category2'])['var1'].transform(myfunc)
print (df)
  category1 category2  var1  var2  var3
0         a         a    66    86    82
1         a         b    92    97    92
2         a         a    98    96    82
3         b         b    17    47    37
4         b         a    83    73    83
5         b         b    57    32    37

Alternative with lambda function:

df['var3'] = (df.groupby(['category1', 'category2'])['var1']
                .transform(lambda s: [np.mean(s)] * len(s)))

answered Sep 29 '22 10:09

jezrael

Related questions
                            
                                Is distributing python source code in Docker secure?
                            
                                Error "Unable to open Jupyter Notebook: Port 8888 is already in use"
                            
                                Understanding the "left_index" and "right_index" arguments in pandas merge
                            
                                python requests - encoding with 'idna' codec failed (UnicodeError: label empty or too long) error
                            
                                Python: Cosine similarity between two large numpy arrays
                            
                                Get filename after a CTRL+C on a file with Windows Explorer
                            
                                How can I plot 2d FEM results using matplotlib?
                            
                                Python docker-compose interpreter in Pycharm: Couldn't find docker binary
                            
                                How to get ISO8601 string for datetime with milliseconds instead of microseconds in python 3.5
                            
                                RabbitMQ pika.exceptions.ConnectionClosed (-1, "error(104, 'Connection reset by peer')")
                            
                                Dataclass subclass does not inherit __repr__
                            
                                Fundamental understanding of tvecs rvecs in OpenCV-ArUco
                            
                                Unknown string format on pd.to_datetime
                            
                                Django DateTimeField says 'You are 5.5 hours ahead of server time.'
                            
                                Create MultiIndex pandas DataFrame from dictionary with tuple keys
                            
                                Expand pandas dataframe column of dict into dataframe columns [duplicate]
                            
                                ModuleNotFoundError: No module named 'google.cloud'
                            
                                Controlling Bin Widths in Altair
                            
                                How to efficiently group pairs based on shared item?
                            
                                Detect whether current shell is powershell in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Groupby Apply Custom Function Pandas

Tags:

python

pandas

dplyr

jtanman

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us