Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python numpy.corrcoef() RuntimeWarning: invalid value encountered in true_divide c /= stddev[:, None]

It seems that corrcoef from numpy throw a RuntimeWarning when a constant list passed to the corrcoef() function, for example the below code throw a warning :

import numpy as np
X = [1.0, 2.0, 3.0, 4.0]
Y = [2, 2, 2, 2]
print(np.corrcoef(X, Y)[0, 1])

Warning :

/usr/local/lib/python3.6/site-packages/numpy/lib/function_base.py:3003: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]

Can anyone explain why it's throw this error when one of the lists is constant, and how to prevent this error when a constant list is passed to the function.

like image 554
Abdennacer Lachiheb Avatar asked Aug 26 '17 15:08

Abdennacer Lachiheb


2 Answers

Correlation is a measure of how well two vectors track with each other as they change. You can't track mutual change when one vector doesn't change.

As noted in OP comments, the formula for Pearson's product-moment correlation coefficient divides the covariance of X and Y by the product of their standard deviations. Since Y has zero variance in your example, its standard deviation is also zero. That's why you get the true_divide error - you're trying to divide by zero.

Note: It might seem tempting, from an engineering standpoint, to simply add a very small quantity (say, a value just above machine epsilon) onto one of the entries in Y, in order to get around the zero-division issue. But that's not statistically viable. Even adding 1e-15 will seriously derange your correlation coefficient, depending on which value you add it to.

Consider the difference between these two cases:

X = [1.0, 2.0, 3.0, 4.0]

tiny = 1e-15

# add tiny amount to second element
Y1 = [2., 2.+tiny, 2., 2.]
np.corrcoef(X, Y1)[0, 1] 
-0.22360679775

# add tiny amount to fourth element
Y2 = [2., 2., 2., 2.+tiny]
np.corrcoef(X, Y2)[0, 1]
0.67082039325

This may be obvious to statisticians, but given the nature of the question it seems like a relevant caveat.

like image 149
andrew_reece Avatar answered Sep 19 '22 15:09

andrew_reece


This is what I tried; Using an if statement, you can check the standard deviation of series and proceed for correlation only when they are more than zero. Apparently here we have no valid output because one series has no variation, but the concept could be applied to other cases.

import numpy as np
X = [1.0, 2.0, 3.0, 4.0]
Y = [2, 2, 2, 2]
if np.std(Y)==0 or np.std(X)==0 :
    print ('The correlation could not be computed because the standard deviation of one of the series is equal to zero')
else:
    print(np.corrcoef(X, Y)[0, 1])
like image 22
EHadavi Avatar answered Sep 18 '22 15:09

EHadavi