Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Calculating standard deviation (row level) of dataframe columns

I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far

# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()

salary         8.194421e-01
num_months     3.690081e+05
no_of_hours    2.518869e+02

# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)

# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()

salary         8.194421e-01

# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)

0       4.374107e+12
1       4.377543e+12
2       4.374026e+12
3       4.374046e+12
4       4.374112e+12
5       4.373926e+12

When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?

# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
like image 345
JKC Avatar asked Dec 17 '18 05:12

JKC


People also ask

How do you find the standard deviation of a column in a DataFrame in Python?

You can use the DataFrame. std() function to calculate the standard deviation of values in a pandas DataFrame. Note that the std() function will automatically ignore any NaN values in the DataFrame when calculating the standard deviation.

How do you find the standard deviation of a row in Pandas?

We can get stdard deviation of DataFrame in rows or columns by using std(). Int (optional ), or tuple, default is None, standard deviation among all the elements.

How do you get the standard deviation of a column entitled salary in the DataFrame data DF?

Sometimes, it may be required to get the standard deviation of a specific column that is numeric in nature. This is where the std() function can be used. The column whose mean needs to be computed can be indexed to the dataframe, and the mean function can be called on this using the dot operator.

How do you find the standard deviation of a column?

Standard deviation is a measure of dispersion of data values from the mean. The formula for standard deviation is the square root of the sum of squared differences from the mean divided by the size of the data set.


1 Answers

It is expected, because if checking DataFrame.std:

Normalized by N-1 by default. This can be changed using the ddof argument

If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.

Sample:

inp_df = pd.DataFrame({'salary':[10,20,30],
                       'num_months':[1,2,3],
                       'no_of_hours':[2,5,6]})
print (inp_df)
   salary  num_months  no_of_hours
0      10           1            2
1      20           2            5
2      30           3            6

Select one column by one [] for Series:

print (inp_df['salary'])
0    10
1    20
2    30
Name: salary, dtype: int64

Get std of Series - get a scalar:

print (inp_df['salary'].std())
10.0

Select one column by double [] for one column DataFrame:

print (inp_df[['salary']])
   salary
0      10
1      20
2      30

Get std of DataFrame per index (default value) - get one element Series:

print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary    10.0
dtype: float64

Get std of DataFrame per columns (axis=1) - get all NaNs:

print (inp_df[['salary']].std(axis = 1))
0   NaN
1   NaN
2   NaN
dtype: float64

If changed default ddof=1 to ddof=0:

print (inp_df[['salary']].std(axis = 1, ddof=0))
0    0.0
1    0.0
2    0.0
dtype: float64

If you want std by two or more columns:

#select 2 columns
print (inp_df[['salary', 'num_months']])
   salary  num_months
0      10           1
1      20           2
2      30           3

#std by index
print (inp_df[['salary','num_months']].std())
salary        10.0
num_months     1.0
dtype: float64

#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0     5.656854
1    10.606602
2    16.970563
dtype: float64
like image 186
jezrael Avatar answered Sep 30 '22 19:09

jezrael