I am new to pandas/python. I would like to know how the function .corr remove the null data of a dataframe with multiple variables when computing the correlation.
For example, let's suppose I have the following dataframe:
# 'A1' 'A2' 'A3'
1 4 3 1
2 2 5 NA
3 3 2 NA
4 NA 10 2
1) Does it remove the entire row in which there is at least one NA/null value? (in this case, only the first row would be considered to compute the correlation matrix)
OR
2) Does it compute pairwise correlation, only excluding individual values? (e.g. for correlation between 'A1' and 'A2', it computes rows 1, 2 and 3; and for correlation between 'A1' and 'A3', it computes row 1 and 4.)
I haven't found such information in the function .corr documentation. It only says it removes the null values. Sorry if it is a silly question. I would be happy to learn where I can find this kind of detailed information regarding functions.
corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.
Pandas will ignore the pairwise correlation if it has NaN value in one of the observations. We can verify that by removing the those values and checking the results.
The possible ways to do this are: Filling the missing data with the mean or median value if it's a numerical variable. Filling the missing data with mode if it's a categorical value. Filling the numerical value with 0 or -999, or some other number that will not occur in the data.
Pandas will ignore the pairwise correlation if it has NaN
value in one of the observations. We can verify that by removing the those values and checking the results.
df
Out[8]:
A1 A2 A3
0 4.0 3 1.0
1 2.0 5 NaN
2 3.0 2 NaN
3 NaN 10 2.0
With the following correlation results:
df.corr()
Out[9]:
A1 A2 A3
A1 1.000000 -0.654654 NaN
A2 -0.654654 1.000000 1.0
A3 NaN 1.000000 1.0
Now if we remove the NaN
from column A1
we can check that the result is the same:
df[pd.isnull(df['A1'])==False].corr()
Out[10]:
A1 A2 A3
A1 1.000000 -0.654654 NaN
A2 -0.654654 1.000000 NaN
A3 NaN NaN NaN
Similarly to A3:
df[pd.isnull(df['A3'])==False].corr()
A1 A2 A3
A1 NaN NaN NaN
A2 NaN 1.0 1.0
A3 NaN 1.0 1.0
Edit
Just to complement a bit the answer, and referring back to this answer, you can see that pandas will ignore NaN
values in the calculations whereas numpy np.corrcoef
will not:
np.corrcoef(df.values)
Out[12]:
array([[ 1., nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With