I have created a Pandas Dataframe and am able to determine the standard deviation of one or more columns of this dataframe (column level). I need to determine the standard deviation for all the rows of a particular column. Below are the commands that I have tried so far
# Will determine the standard deviation of all the numerical columns by default.
inp_df.std()
salary 8.194421e-01
num_months 3.690081e+05
no_of_hours 2.518869e+02
# Same as above command. Performs the standard deviation at the column level.
inp_df.std(axis = 0)
# Determines the standard deviation over only the salary column of the dataframe.
inp_df[['salary']].std()
salary 8.194421e-01
# Determines Standard Deviation for every row present in the dataframe. But it
# does this for the entire row and it will output values in a single column.
# One std value for each row.
inp_df.std(axis=1)
0 4.374107e+12
1 4.377543e+12
2 4.374026e+12
3 4.374046e+12
4 4.374112e+12
5 4.373926e+12
When I execute the below command I am getting "NaN" for all the records. Is there a way to resolve this?
# Trying to determine standard deviation only for the "salary" column at the
# row level.
inp_df[['salary']].std(axis = 1)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
You can use the DataFrame. std() function to calculate the standard deviation of values in a pandas DataFrame. Note that the std() function will automatically ignore any NaN values in the DataFrame when calculating the standard deviation.
We can get stdard deviation of DataFrame in rows or columns by using std(). Int (optional ), or tuple, default is None, standard deviation among all the elements.
Sometimes, it may be required to get the standard deviation of a specific column that is numeric in nature. This is where the std() function can be used. The column whose mean needs to be computed can be indexed to the dataframe, and the mean function can be called on this using the dot operator.
Standard deviation is a measure of dispersion of data values from the mean. The formula for standard deviation is the square root of the sum of squared differences from the mean divided by the size of the data set.
It is expected, because if checking DataFrame.std
:
Normalized by N-1 by default. This can be changed using the ddof argument
If you have one element, you're doing a division by 0. So if you have one column and want the sample standard deviation over columns, get all the missing values.
Sample:
inp_df = pd.DataFrame({'salary':[10,20,30],
'num_months':[1,2,3],
'no_of_hours':[2,5,6]})
print (inp_df)
salary num_months no_of_hours
0 10 1 2
1 20 2 5
2 30 3 6
Select one column by one []
for Series
:
print (inp_df['salary'])
0 10
1 20
2 30
Name: salary, dtype: int64
Get std
of Series
- get a scalar:
print (inp_df['salary'].std())
10.0
Select one column by double []
for one column DataFrame
:
print (inp_df[['salary']])
salary
0 10
1 20
2 30
Get std
of DataFrame
per index (default value) - get one element Series
:
print (inp_df[['salary']].std())
#same like
#print (inp_df[['salary']].std(axis=0))
salary 10.0
dtype: float64
Get std
of DataFrame
per columns (axis=1) - get all NaNs:
print (inp_df[['salary']].std(axis = 1))
0 NaN
1 NaN
2 NaN
dtype: float64
If changed default ddof=1
to ddof=0
:
print (inp_df[['salary']].std(axis = 1, ddof=0))
0 0.0
1 0.0
2 0.0
dtype: float64
If you want std
by two or more columns:
#select 2 columns
print (inp_df[['salary', 'num_months']])
salary num_months
0 10 1
1 20 2
2 30 3
#std by index
print (inp_df[['salary','num_months']].std())
salary 10.0
num_months 1.0
dtype: float64
#std by columns
print (inp_df[['salary','no_of_hours']].std(axis = 1))
0 5.656854
1 10.606602
2 16.970563
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With