I am new to pandas/python. I would like to know how the function .corr remove the null data of a dataframe with multiple variables when computing the correlation. For example, let's suppose I have the following dataframe: <pre class="prettyprint"><code> # 'A1' 'A2' 'A3' 1 4 3 1 2 2 5 NA 3 3 2 NA 4 NA 10 2 </code></pre> 1) Does it remove the entire row in which there is at least one NA/null value? (in this case, only the first row would be considered to compute the correlation matrix) OR 2) Does it compute pairwise correlation, only excluding individual values? (e.g. for correlation between 'A1' and 'A2', it computes rows 1, 2 and 3; and for correlation between 'A1' and 'A3', it computes row 1 and 4.) I haven't found such information in the function .corr documentation. It only says it removes the null values. Sorry if it is a silly question. I would be happy to learn where I can find this kind of detailed information regarding functions.

Pandas will ignore the pairwise correlation if it has <code>NaN</code> value in one of the observations. We can verify that by removing the those values and checking the results. <pre class="prettyprint lang-py prettyprint-override"><code>df Out[8]: A1 A2 A3 0 4.0 3 1.0 1 2.0 5 NaN 2 3.0 2 NaN 3 NaN 10 2.0 </code></pre> With the following correlation results: <pre class="prettyprint lang-py prettyprint-override"><code>df.corr() Out[9]: A1 A2 A3 A1 1.000000 -0.654654 NaN A2 -0.654654 1.000000 1.0 A3 NaN 1.000000 1.0 </code></pre> Now if we remove the <code>NaN</code> from column <code>A1</code> we can check that the result is the same: <pre class="prettyprint lang-py prettyprint-override"><code>df[pd.isnull(df['A1'])==False].corr() Out[10]: A1 A2 A3 A1 1.000000 -0.654654 NaN A2 -0.654654 1.000000 NaN A3 NaN NaN NaN </code></pre> Similarly to A3: <pre class="prettyprint lang-py prettyprint-override"><code>df[pd.isnull(df['A3'])==False].corr() A1 A2 A3 A1 NaN NaN NaN A2 NaN 1.0 1.0 A3 NaN 1.0 1.0 </code></pre> Edit Just to complement a bit the answer, and referring back to this answer, you can see that pandas will ignore <code>NaN</code> values in the calculations whereas numpy <code>np.corrcoef</code> will not: <pre class="prettyprint lang-py prettyprint-override"><code>np.corrcoef(df.values) Out[12]: array([[ 1., nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan]]) </code></pre>

How does .corr remove NA and null values?

Tags:

python

pandas

I am new to pandas/python. I would like to know how the function .corr remove the null data of a dataframe with multiple variables when computing the correlation.

For example, let's suppose I have the following dataframe:

  #  'A1'  'A2' 'A3'
  1   4     3    1
  2   2     5    NA
  3   3     2    NA
  4   NA    10   2

1) Does it remove the entire row in which there is at least one NA/null value? (in this case, only the first row would be considered to compute the correlation matrix)

2) Does it compute pairwise correlation, only excluding individual values? (e.g. for correlation between 'A1' and 'A2', it computes rows 1, 2 and 3; and for correlation between 'A1' and 'A3', it computes row 1 and 4.)

I haven't found such information in the function .corr documentation. It only says it removes the null values. Sorry if it is a silly question. I would be happy to learn where I can find this kind of detailed information regarding functions.

471

asked Jul 23 '19 01:07

Rafaela V

1 Answers

Pandas will ignore the pairwise correlation if it has NaN value in one of the observations. We can verify that by removing the those values and checking the results.

df

Out[8]: 
    A1  A2   A3
0  4.0   3  1.0
1  2.0   5  NaN
2  3.0   2  NaN
3  NaN  10  2.0

With the following correlation results:

df.corr()

Out[9]: 
          A1        A2   A3
A1  1.000000 -0.654654  NaN
A2 -0.654654  1.000000  1.0
A3       NaN  1.000000  1.0

Now if we remove the NaN from column A1 we can check that the result is the same:

df[pd.isnull(df['A1'])==False].corr()

Out[10]: 
          A1        A2  A3
A1  1.000000 -0.654654 NaN
A2 -0.654654  1.000000 NaN
A3       NaN       NaN NaN

Similarly to A3:

df[pd.isnull(df['A3'])==False].corr()

    A1   A2   A3
A1 NaN  NaN  NaN
A2 NaN  1.0  1.0
A3 NaN  1.0  1.0

Edit

Just to complement a bit the answer, and referring back to this answer, you can see that pandas will ignore NaN values in the calculations whereas numpy np.corrcoef will not:

np.corrcoef(df.values)

Out[12]: 
array([[ 1., nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])

188

answered Sep 22 '22 16:09

calestini

Related questions
                            
                                Improve min/max downsampling
                            
                                uwsgi master graceful shutdown
                            
                                Pandas read_excel sometimes creates index even when index_col=None
                            
                                How can I fix "TypeError: cannot serialize '_io.BufferedReader' object" error when trying to multiprocess
                            
                                How to determine if numba's prange actually works correctly?
                            
                                How to increase timeout for NGINX?
                            
                                Forcing IPython to execute the current multiline code block
                            
                                Why are some Python package names different than their import name?
                            
                                Don't skip blank lines in pandas.read_excel()
                            
                                Convert raw Ipython Notebook txt to Ipynb
                            
                                GIL behavior in python 3.7 multithreading
                            
                                Pandas- ValueError: Usecols do not match columns, columns expected but not found
                            
                                Can pip (python2) and pip3 (python3) coexist?
                            
                                Multiple ranges / np.arange [duplicate]
                            
                                what is the difference between conv2d and Conv2D in Keras?
                            
                                How to speed up symbolic derivatives of long functions using SymPy?
                            
                                DataFrame object has no attribute 'name'
                            
                                Sending RabbitMq messages between Docker containers using docker-compose
                            
                                How do I alias a python module at packaging time?
                            
                                Is ray `num_cpus` used to actually allocate CPUs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With