Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby agg std NaN

Inputs:

df['PopEst']
    .astype('float')
    .groupby(ContinentDict)
    .agg(['size','sum','mean','std']))

Outputs:

            size            sum                mean              std
Asia          5     2.898666e+09       5.797333e+08     6.790979e+08
Australia     1     2.331602e+07       2.331602e+07              NaN
Europe        6     4.579297e+08       7.632161e+07     3.464767e+07
North America 2     3.528552e+08       1.764276e+08     1.996696e+08
South America 1     2.059153e+08       2.059153e+08              NaN

Some values in column of std turns out to be NaN if the group just have one row, but I think these values are supposed to be 0, why is that?

like image 885
Alex J Avatar asked May 12 '18 13:05

Alex J


People also ask

Does pandas GroupBy ignore NaN?

From the docs: "NA groups in GroupBy are automatically excluded".

What is AGG () in pandas?

Pandas DataFrame agg() Method The agg() method allows you to apply a function or a list of function names to be executed along one of the axis of the DataFrame, default 0, which is the index (row) axis. Note: the agg() method is an alias of the aggregate() method.

What is AGG in GroupBy?

agg is an alias for aggregate . Use the alias. A passed user-defined-function will be passed a Series for evaluation. The aggregation is for each column.

How does Python calculate GroupBy standard deviation?

The following is a step-by-step guide of what you need to do. Group the dataframe on the column(s) you want. Select the field(s) for which you want to estimate the standard deviation. Apply the pandas std() function directly or pass 'std' to the agg() function.


1 Answers

pd.DataFrame.std assumes 1 degree of freedom by default, also known as sample standard deviation. This results in NaN results for groups with one number.

numpy.std, by contrast, assumes 0 degree of freedom by default, also known as population standard deviation. This gives 0 for groups with one number.

To understand the difference between sample and population, see Bessel's correction.

Therefore, you can specify numpy.std for your calculation. Note, however, that the output will be different as the calculation is different. Here's a minimal example.

import pandas as pd, numpy as np

df = pd.DataFrame(np.random.randint(0, 9, (5, 2)))

def std(x): return np.std(x)

res = df.groupby(0)[1].agg(['size', 'sum', 'mean', std])

print(res)

   size  sum  mean       std
0                           
0     2   13   6.5       0.5
4     1    3   3.0       0.0
5     1    3   3.0       0.0
6     1    3   3.0       0.0

Alternatively, if you require 1 degree of freedom, you can use fillna to replace NaN values with 0:

res = df.groupby(0)[1].agg(['size', 'sum', 'mean', 'std']).fillna(0)
like image 108
jpp Avatar answered Sep 20 '22 14:09

jpp