Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate STD manually using Groupby Pandas DataFrame

I was trying to write a solution for this question by providing a different and a manual way to calculate a mean and std.

I created the dataframe as described in the question

a= ["Apple","Banana","Cherry","Apple"]
b= [3,4,7,3]
c= [5,4,1,4]
d= [7,8,3,7]

import pandas as pd
df =  pd.DataFrame(index=range(4), columns=list("ABCD"))

df["A"]=a
df["B"]=b
df["C"]=c
df["D"]=d

Then, I created a list of A's without duplication. Then I went through the items, by grouping everytime the items and calculate the solution.

import numpy as np

l= list(set(df.A))

df.groupby('A', as_index=False)
listMean=[0]*len(df.C)
listSTD=[0]*len(df.C)

for x in l:
    s= np.mean(df[df['A']==x].C.values)
    z= [index for index, item in enumerate(df['A'].values) if x==item ]
    for i in z:
        listMean[i]=s

for x in l:
    s=  np.std(df[df['A']==x].C.values)
    z= [index for index, item in enumerate(df['A'].values) if x==item ]
    for i in z:
        listSTD[i]=s

df['C']= listMean
df['E']= listSTD

print df

I used describe() grouped by "A" to calculate the mean, std.

print df.groupby('A').describe()

And tested the suggested solution:

result = df.groupby(['a'], as_index=False).agg(
                      {'c':['mean','std'],'b':'first', 'd':'first'})

I noticed that I got different results when I calculate std ("E"). I am just curious, what did I miss ?

like image 787
user3378649 Avatar asked Apr 28 '26 16:04

user3378649


1 Answers

There are two kinds of standard deviations (SD): the population SD and the sample SD.

The population SD

enter image description here

is used when the values represent the entire universe of values that you are studying.

The sample SD

enter image description here

is used when the values are a mere sample from that universe.

np.std calculates the population SD by default, while Pandas' Series.std calculates the sample SD by default.

In [42]: np.std([4,5])
Out[42]: 0.5

In [43]: np.std([4,5], ddof=0)
Out[43]: 0.5

In [44]: np.std([4,5], ddof=1)
Out[44]: 0.70710678118654757

In [45]: x = pd.Series([4,5])

In [46]: x.std()
Out[46]: 0.70710678118654757

In [47]: x.std(ddof=0)
Out[47]: 0.5

ddof stands for "degrees of freedom", and controls the number subtracted from N in the SD formulas.

The formula images above come from this Wikipedia page. There the "uncorrected sample standard deviation" is what I (and others) call the population SD, and the "corrected sample standard deviation" is the sample SD.

like image 64
unutbu Avatar answered May 01 '26 07:05

unutbu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!