Python Sklearn Covariance Matrix diagonal entries incorrect?

Question

I am trying to perform PCA on some data. From my knowledge, the correlation matrix should have entries of 1 along the main diagonal. This is not what I am seeing for .get_covariance() in sklearn PCA. I am wondering why this is the case?
For my own purposes, I can just perform a scaling of the matrix to obtain a matrix with diagonal entries of 1, but I was just wondering, since I have already standardized my data, why are the diagonal entries still not 1?

In [1]: import pandas as pd

In [2]: import numpy as np                                                                                                                      

In [3]: from sklearn.decomposition import PCA                                                                                                   

In [4]: df = pd.read_csv('myTable.csv')                                                                                                         

In [5]: df                                                                                                                                      
Out[5]:                                                                                                                                         
         a1        a2        a3        a4        a5                                                                                             
0 -0.559104  0.185914 -2.331367  0.231150  0.357008                                                                                             
1  0.769835 -0.408685  0.375754  0.051397 -0.075885                                                                                             
2 -1.376530 -0.764808 -2.383611 -0.327153  1.746765                                                                                             
3 -0.830105 -0.197574  1.835807 -0.695089  0.881297                                                                                             
4 -0.991861  1.089319 -0.164139 -0.335003  0.795937                                                                                             
5 -1.132968 -2.240598 -0.101935  0.680038 -0.033921                                                                                             
6 -1.205631 -1.492009 -0.602400 -0.065256 -0.494267                                                                                             
7 -1.210978 -1.220986 -0.017062  0.024422 -0.224585                                                                                             
8 -0.332957  2.114870  0.818108  0.612831 -1.879758                                                                                             
9 -0.350612 -0.563872  0.869303 -0.325626 -0.372874                                                                                             

In [6]: df = (df-df.mean())/df.std()                                                                                                            

In [7]: pca = PCA()                                                                                                                             

In [8]: pca.fit(df)                                                                                                                             
Out[8]: PCA(copy=True, n_components=None, whiten=False)  

In [10]: pca.explained_variance_, pca.components_, pca.get_covariance()                                                                         
Out[10]:                                                                                                                                        
(array([ 1.8780651 ,  1.1526052 ,  0.78052872,  0.55167761,  0.13712337]),                                                                      
 array([[-0.47790108, -0.36036503, -0.38619941, -0.35716396,  0.60417838],                                                                      
        [ 0.25426743,  0.32305024,  0.47784502, -0.72831952,  0.26870322],                                                                      
        [-0.17613902, -0.7303121 ,  0.6250759 , -0.05118019, -0.20562097],                                                                      
        [ 0.82132736, -0.45982165, -0.21938834,  0.03274499,  0.25452296],                                                                      
        [ 0.03681087, -0.14485808, -0.42855924, -0.58162955, -0.67505936]]),                                                                    
 array([[ 0.9       ,  0.30943895,  0.29916112,  0.12605405, -0.32333097],                                                                      
        [ 0.30943895,  0.9       ,  0.14715469,  0.00295615, -0.24279645],                                                                      
        [ 0.29916112,  0.14715469,  0.9       , -0.13683409, -0.38167791],                                                                      
        [ 0.12605405,  0.00295615, -0.13683409,  0.9       , -0.56418468],                                                                      
        [-0.32333097, -0.24279645, -0.38167791, -0.56418468,  0.9       ]]))

Closed
The problem was with my standardization. I was supposed to use df.std(ddof=0) as suggested by Tonechas

Tonechas · Accepted Answer

You need to normalize the standard deviation by N rather than by N-1 (which is the default value). This can be changed using the ddof parameter in the call to pandas.DataFrame.std() like this:

In [146]: from sklearn.decomposition import PCA

In [147]: df
Out[147]: 
         a1        a2        a3        a4        a5
0 -0.559104  0.185914 -2.331367  0.231150 -0.559104
1  0.769835 -0.408685  0.375754  0.051397  0.769835
2 -1.376530 -0.764808 -2.383611 -0.327153 -1.376530
3 -0.830105 -0.197574  1.835807 -0.695089 -0.830105
4 -0.991861  1.089319 -0.164139 -0.335003 -0.991861
5 -1.132968 -2.240598 -0.101935  0.680038 -1.132968
6 -1.205631 -1.492009 -0.602400 -0.065256 -1.205631
7 -1.210978 -1.220986 -0.017062  0.024422 -1.210978
8 -0.332957  2.114870  0.818108  0.612831 -0.332957
9 -0.350612 -0.563872  0.869303 -0.325626 -0.350612

In [148]: df = (df-df.mean())/df.std(ddof=0)

In [149]: pca = PCA()

In [150]: pca.fit(df)
Out[150]: 
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [151]: pca.get_covariance()
Out[151]: 
array([[ 1.  ,  0.34,  0.33,  0.14,  1.  ],
       [ 0.34,  1.  ,  0.16,  0.  ,  0.34],
       [ 0.33,  0.16,  1.  , -0.15,  0.33],
       [ 0.14,  0.  , -0.15,  1.  ,  0.14],
       [ 1.  ,  0.34,  0.33,  0.14,  1.  ]])

Paul Panzer · Answer

PCA and correlation matrix are different things. The correlation matrix is just the product if the centred and normalised data with its transpose (there might be slightly differering definitions in the wild) PCA is a decomposition not dissimilar to the eigen decomposition. In particular, the PCs are degeneracies aside orthogonal, so no correlations there.

Of course, the two are related, for example if all you vectors are correlated you'd expect a corresponding PC with a high weight.

Python Sklearn Covariance Matrix diagonal entries incorrect?

Tags:

python

scikit-learn

covariance

correlation

pca

AsheKetchum

2 Answers

Tonechas

Paul Panzer

Recent Activity

Donate For Us

Python Sklearn Covariance Matrix diagonal entries incorrect?

Tags:

python

scikit-learn

covariance

correlation

pca

AsheKetchum

2 Answers

Tonechas

Paul Panzer

Related questions

Recent Activity

Donate For Us