Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does .corr remove NA and null values?

Tags:

python

pandas

I am new to pandas/python. I would like to know how the function .corr remove the null data of a dataframe with multiple variables when computing the correlation.

For example, let's suppose I have the following dataframe:

  #  'A1'  'A2' 'A3'
  1   4     3    1
  2   2     5    NA
  3   3     2    NA
  4   NA    10   2

1) Does it remove the entire row in which there is at least one NA/null value? (in this case, only the first row would be considered to compute the correlation matrix)

OR

2) Does it compute pairwise correlation, only excluding individual values? (e.g. for correlation between 'A1' and 'A2', it computes rows 1, 2 and 3; and for correlation between 'A1' and 'A3', it computes row 1 and 4.)

I haven't found such information in the function .corr documentation. It only says it removes the null values. Sorry if it is a silly question. I would be happy to learn where I can find this kind of detailed information regarding functions.

like image 471
Rafaela V Avatar asked Jul 23 '19 01:07

Rafaela V


People also ask

How does .corr handle NaN?

corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python. Any NaN values are automatically excluded. Any non-numeric data type or columns in the Dataframe, it is ignored.

How does pandas Corr deal with NaN?

Pandas will ignore the pairwise correlation if it has NaN value in one of the observations. We can verify that by removing the those values and checking the results.

How does Python deal with Na?

The possible ways to do this are: Filling the missing data with the mean or median value if it's a numerical variable. Filling the missing data with mode if it's a categorical value. Filling the numerical value with 0 or -999, or some other number that will not occur in the data.


1 Answers

Pandas will ignore the pairwise correlation if it has NaN value in one of the observations. We can verify that by removing the those values and checking the results.

df

Out[8]: 
    A1  A2   A3
0  4.0   3  1.0
1  2.0   5  NaN
2  3.0   2  NaN
3  NaN  10  2.0

With the following correlation results:

df.corr()

Out[9]: 
          A1        A2   A3
A1  1.000000 -0.654654  NaN
A2 -0.654654  1.000000  1.0
A3       NaN  1.000000  1.0

Now if we remove the NaN from column A1 we can check that the result is the same:

df[pd.isnull(df['A1'])==False].corr()

Out[10]: 
          A1        A2  A3
A1  1.000000 -0.654654 NaN
A2 -0.654654  1.000000 NaN
A3       NaN       NaN NaN

Similarly to A3:

df[pd.isnull(df['A3'])==False].corr()

    A1   A2   A3
A1 NaN  NaN  NaN
A2 NaN  1.0  1.0
A3 NaN  1.0  1.0

Edit

Just to complement a bit the answer, and referring back to this answer, you can see that pandas will ignore NaN values in the calculations whereas numpy np.corrcoef will not:

np.corrcoef(df.values)

Out[12]: 
array([[ 1., nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])
like image 188
calestini Avatar answered Sep 22 '22 16:09

calestini