Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Sklearn Covariance Matrix diagonal entries incorrect?

I am trying to perform PCA on some data. From my knowledge, the correlation matrix should have entries of 1 along the main diagonal. This is not what I am seeing for .get_covariance() in sklearn PCA. I am wondering why this is the case?
For my own purposes, I can just perform a scaling of the matrix to obtain a matrix with diagonal entries of 1, but I was just wondering, since I have already standardized my data, why are the diagonal entries still not 1?

In [1]: import pandas as pd

In [2]: import numpy as np                                                                                                                      

In [3]: from sklearn.decomposition import PCA                                                                                                   

In [4]: df = pd.read_csv('myTable.csv')                                                                                                         

In [5]: df                                                                                                                                      
Out[5]:                                                                                                                                         
         a1        a2        a3        a4        a5                                                                                             
0 -0.559104  0.185914 -2.331367  0.231150  0.357008                                                                                             
1  0.769835 -0.408685  0.375754  0.051397 -0.075885                                                                                             
2 -1.376530 -0.764808 -2.383611 -0.327153  1.746765                                                                                             
3 -0.830105 -0.197574  1.835807 -0.695089  0.881297                                                                                             
4 -0.991861  1.089319 -0.164139 -0.335003  0.795937                                                                                             
5 -1.132968 -2.240598 -0.101935  0.680038 -0.033921                                                                                             
6 -1.205631 -1.492009 -0.602400 -0.065256 -0.494267                                                                                             
7 -1.210978 -1.220986 -0.017062  0.024422 -0.224585                                                                                             
8 -0.332957  2.114870  0.818108  0.612831 -1.879758                                                                                             
9 -0.350612 -0.563872  0.869303 -0.325626 -0.372874                                                                                             

In [6]: df = (df-df.mean())/df.std()                                                                                                            

In [7]: pca = PCA()                                                                                                                             

In [8]: pca.fit(df)                                                                                                                             
Out[8]: PCA(copy=True, n_components=None, whiten=False)  

In [10]: pca.explained_variance_, pca.components_, pca.get_covariance()                                                                         
Out[10]:                                                                                                                                        
(array([ 1.8780651 ,  1.1526052 ,  0.78052872,  0.55167761,  0.13712337]),                                                                      
 array([[-0.47790108, -0.36036503, -0.38619941, -0.35716396,  0.60417838],                                                                      
        [ 0.25426743,  0.32305024,  0.47784502, -0.72831952,  0.26870322],                                                                      
        [-0.17613902, -0.7303121 ,  0.6250759 , -0.05118019, -0.20562097],                                                                      
        [ 0.82132736, -0.45982165, -0.21938834,  0.03274499,  0.25452296],                                                                      
        [ 0.03681087, -0.14485808, -0.42855924, -0.58162955, -0.67505936]]),                                                                    
 array([[ 0.9       ,  0.30943895,  0.29916112,  0.12605405, -0.32333097],                                                                      
        [ 0.30943895,  0.9       ,  0.14715469,  0.00295615, -0.24279645],                                                                      
        [ 0.29916112,  0.14715469,  0.9       , -0.13683409, -0.38167791],                                                                      
        [ 0.12605405,  0.00295615, -0.13683409,  0.9       , -0.56418468],                                                                      
        [-0.32333097, -0.24279645, -0.38167791, -0.56418468,  0.9       ]]))   

Closed
The problem was with my standardization. I was supposed to use df.std(ddof=0) as suggested by Tonechas

like image 993
AsheKetchum Avatar asked Dec 18 '25 01:12

AsheKetchum


2 Answers

You need to normalize the standard deviation by N rather than by N-1 (which is the default value). This can be changed using the ddof parameter in the call to pandas.DataFrame.std() like this:

In [146]: from sklearn.decomposition import PCA

In [147]: df
Out[147]: 
         a1        a2        a3        a4        a5
0 -0.559104  0.185914 -2.331367  0.231150 -0.559104
1  0.769835 -0.408685  0.375754  0.051397  0.769835
2 -1.376530 -0.764808 -2.383611 -0.327153 -1.376530
3 -0.830105 -0.197574  1.835807 -0.695089 -0.830105
4 -0.991861  1.089319 -0.164139 -0.335003 -0.991861
5 -1.132968 -2.240598 -0.101935  0.680038 -1.132968
6 -1.205631 -1.492009 -0.602400 -0.065256 -1.205631
7 -1.210978 -1.220986 -0.017062  0.024422 -1.210978
8 -0.332957  2.114870  0.818108  0.612831 -0.332957
9 -0.350612 -0.563872  0.869303 -0.325626 -0.350612

In [148]: df = (df-df.mean())/df.std(ddof=0)

In [149]: pca = PCA()

In [150]: pca.fit(df)
Out[150]: 
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [151]: pca.get_covariance()
Out[151]: 
array([[ 1.  ,  0.34,  0.33,  0.14,  1.  ],
       [ 0.34,  1.  ,  0.16,  0.  ,  0.34],
       [ 0.33,  0.16,  1.  , -0.15,  0.33],
       [ 0.14,  0.  , -0.15,  1.  ,  0.14],
       [ 1.  ,  0.34,  0.33,  0.14,  1.  ]])
like image 70
Tonechas Avatar answered Dec 20 '25 15:12

Tonechas


PCA and correlation matrix are different things. The correlation matrix is just the product if the centred and normalised data with its transpose (there might be slightly differering definitions in the wild) PCA is a decomposition not dissimilar to the eigen decomposition. In particular, the PCs are degeneracies aside orthogonal, so no correlations there.

Of course, the two are related, for example if all you vectors are correlated you'd expect a corresponding PC with a high weight.

like image 20
Paul Panzer Avatar answered Dec 20 '25 16:12

Paul Panzer