Calculate STD manually using Groupby Pandas DataFrame

Question

I was trying to write a solution for this question by providing a different and a manual way to calculate a mean and std.

I created the dataframe as described in the question

a= ["Apple","Banana","Cherry","Apple"]
b= [3,4,7,3]
c= [5,4,1,4]
d= [7,8,3,7]

import pandas as pd
df =  pd.DataFrame(index=range(4), columns=list("ABCD"))

df["A"]=a
df["B"]=b
df["C"]=c
df["D"]=d

Then, I created a list of A's without duplication. Then I went through the items, by grouping everytime the items and calculate the solution.

import numpy as np

l= list(set(df.A))

df.groupby('A', as_index=False)
listMean=[0]*len(df.C)
listSTD=[0]*len(df.C)

for x in l:
    s= np.mean(df[df['A']==x].C.values)
    z= [index for index, item in enumerate(df['A'].values) if x==item ]
    for i in z:
        listMean[i]=s

for x in l:
    s=  np.std(df[df['A']==x].C.values)
    z= [index for index, item in enumerate(df['A'].values) if x==item ]
    for i in z:
        listSTD[i]=s

df['C']= listMean
df['E']= listSTD

print df

I used describe() grouped by "A" to calculate the mean, std.

print df.groupby('A').describe()

And tested the suggested solution:

result = df.groupby(['a'], as_index=False).agg(
                      {'c':['mean','std'],'b':'first', 'd':'first'})

I noticed that I got different results when I calculate std ("E"). I am just curious, what did I miss ?

unutbu · Accepted Answer

There are two kinds of standard deviations (SD): the population SD and the sample SD.

The population SD

enter image description here

is used when the values represent the entire universe of values that you are studying.

The sample SD

enter image description here

is used when the values are a mere sample from that universe.

np.std calculates the population SD by default, while Pandas' Series.std calculates the sample SD by default.

In [42]: np.std([4,5])
Out[42]: 0.5

In [43]: np.std([4,5], ddof=0)
Out[43]: 0.5

In [44]: np.std([4,5], ddof=1)
Out[44]: 0.70710678118654757

In [45]: x = pd.Series([4,5])

In [46]: x.std()
Out[46]: 0.70710678118654757

In [47]: x.std(ddof=0)
Out[47]: 0.5

ddof stands for "degrees of freedom", and controls the number subtracted from N in the SD formulas.

The formula images above come from this Wikipedia page. There the "uncorrected sample standard deviation" is what I (and others) call the population SD, and the "corrected sample standard deviation" is the sample SD.

Calculate STD manually using Groupby Pandas DataFrame

Tags:

python

algorithm

pandas

user3378649

1 Answers

unutbu

Recent Activity

Donate For Us

Calculate STD manually using Groupby Pandas DataFrame

Tags:

python

algorithm

pandas

user3378649

1 Answers

unutbu

Related questions

Recent Activity

Donate For Us