Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Groupby Pandas DataFrame and calculate mean and stdev of one column and add the std as a new column with reset_index

Tags:

python

pandas

I have a Pandas DataFrame as below:

   a      b      c      d 0  Apple  3      5      7 1  Banana 4      4      8 2  Cherry 7      1      3 3  Apple  3      4      7 

I would like to group the rows by column 'a' while replacing values in column 'c' by the mean of values in grouped rows and add another column with std deviation of the values in column 'c' whose mean has been calculated. The values in column 'b' or 'd' are constant for all rows being grouped. So, the desired output would be:

   a      b      c      d      e 0  Apple  3      4.5    7      0.707107 1  Banana 4      4      8      0 2  Cherry 7      1      3      0 

What is the best way to achieve this?

like image 682
kkhatri99 Avatar asked Oct 28 '14 01:10

kkhatri99


People also ask

How do you do mean for a specific column in pandas?

DataFrame. mean() method gets the mean value of a particular column from pandas DataFrame, you can use the df["Fee"]. mean() function for a specific column only.

How do you find the mean and standard deviation of a panda?

In pandas, the std() function is used to find the standard Deviation of the series. The mean can be simply defined as the average of numbers. In pandas, the mean() function is used to find the mean of the series.


1 Answers

You could use a groupby-agg operation:

In [38]: result = df.groupby(['a'], as_index=False).agg(                       {'c':['mean','std'],'b':'first', 'd':'first'}) 

and then rename and reorder the columns:

In [39]: result.columns = ['a','c','e','b','d']  In [40]: result.reindex(columns=sorted(result.columns)) Out[40]:          a  b    c  d         e 0   Apple  3  4.5  7  0.707107 1  Banana  4  4.0  8       NaN 2  Cherry  7  1.0  3       NaN 

Pandas computes the sample std by default. To compute the population std:

def pop_std(x):     return x.std(ddof=0)  result = df.groupby(['a'], as_index=False).agg({'c':['mean',pop_std],'b':'first', 'd':'first'})  result.columns = ['a','c','e','b','d'] result.reindex(columns=sorted(result.columns)) 

yields

        a  b    c  d    e 0   Apple  3  4.5  7  0.5 1  Banana  4  4.0  8  0.0 2  Cherry  7  1.0  3  0.0 
like image 112
unutbu Avatar answered Sep 19 '22 14:09

unutbu