I was trying to write a solution for this question by providing a different and a manual way to calculate a mean and std.
I created the dataframe as described in the question
a= ["Apple","Banana","Cherry","Apple"]
b= [3,4,7,3]
c= [5,4,1,4]
d= [7,8,3,7]
import pandas as pd
df = pd.DataFrame(index=range(4), columns=list("ABCD"))
df["A"]=a
df["B"]=b
df["C"]=c
df["D"]=d
Then, I created a list of A's without duplication. Then I went through the items, by grouping everytime the items and calculate the solution.
import numpy as np
l= list(set(df.A))
df.groupby('A', as_index=False)
listMean=[0]*len(df.C)
listSTD=[0]*len(df.C)
for x in l:
s= np.mean(df[df['A']==x].C.values)
z= [index for index, item in enumerate(df['A'].values) if x==item ]
for i in z:
listMean[i]=s
for x in l:
s= np.std(df[df['A']==x].C.values)
z= [index for index, item in enumerate(df['A'].values) if x==item ]
for i in z:
listSTD[i]=s
df['C']= listMean
df['E']= listSTD
print df
I used describe() grouped by "A" to calculate the mean, std.
print df.groupby('A').describe()
And tested the suggested solution:
result = df.groupby(['a'], as_index=False).agg(
{'c':['mean','std'],'b':'first', 'd':'first'})
I noticed that I got different results when I calculate std ("E"). I am just curious, what did I miss ?
There are two kinds of standard deviations (SD): the population SD and the sample SD.
The population SD

is used when the values represent the entire universe of values that you are studying.
The sample SD

is used when the values are a mere sample from that universe.
np.std calculates the population SD by default, while Pandas' Series.std calculates the sample SD by default.
In [42]: np.std([4,5])
Out[42]: 0.5
In [43]: np.std([4,5], ddof=0)
Out[43]: 0.5
In [44]: np.std([4,5], ddof=1)
Out[44]: 0.70710678118654757
In [45]: x = pd.Series([4,5])
In [46]: x.std()
Out[46]: 0.70710678118654757
In [47]: x.std(ddof=0)
Out[47]: 0.5
ddof stands for "degrees of freedom", and controls the number subtracted from N in the SD formulas.
The formula images above come from this Wikipedia page. There the "uncorrected sample standard deviation" is what I (and others) call the population SD, and the "corrected sample standard deviation" is the sample SD.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With