Scikit-Learn PCA

Tags:

I am using input data from here (see Section 3.1).

I am trying to reproduce their covariance matrix, eigenvalues, and eigenvectors using scikit-learn. However, I am unable to reproduce the results as presented in the data source. I've also seen this input data elsewhere but I can't discern whether it's a problem with scikit-learn, my steps, or the data source.

data = np.array([[2.5,2.4],
                 [0.5,0.7],
                 [2.2,2.9],
                 [1.9,2.2],
                 [3.1,3.0],
                 [2.3,2.7],
                 [2.0,1.6],
                 [1.0,1.1],
                 [1.5,1.6],
                 [1.1,0.9],
                 ]) 

centered_data = data-data.mean(axis=0)
pca = PCA()
pca.fit(centered_data)
print(pca.get_covariance()) #Covariance Matrix

array([[ 0.5549,  0.5539],
   [ 0.5539,  0.6449]])

print(pca.explained_variance_ratio_) #Eigenvalues (normalized)

[ 0.96318131  0.03681869]

print(pca.components_) #Eigenvectors

[[-0.6778734  -0.73517866]
 [ 0.73517866 -0.6778734 ]]

Surprisingly, the projections matches the results from the data source described above.

print(pca.transform(centered_data)) #Projections

array([[-0.82797019,  0.17511531],
   [ 1.77758033, -0.14285723],
   [-0.99219749, -0.38437499],
   [-0.27421042, -0.13041721],
   [-1.67580142,  0.20949846],
   [-0.9129491 , -0.17528244],
   [ 0.09910944,  0.3498247 ],
   [ 1.14457216, -0.04641726],
   [ 0.43804614, -0.01776463],
   [ 1.22382056,  0.16267529]])

Here is what I don't understand:

Why is the covariance matrix is different?
Updated: How do I get eigenvalues from scikit-learn that are not already normalized?

727

asked Dec 30 '14 04:12

slaw

1 Answers

Correct covariance matrix of this data:

numpy.cov(data.transpose())

array([[ 0.61655556,  0.61544444],
       [ 0.61544444,  0.71655556]])

Biased (i.e. "incorrect", using the wrong normalization term, and underestimating variance in the data set) covariance matrix:

numpy.cov(data.transpose(), bias=1)

array([[ 0.5549,  0.5539],
       [ 0.5539,  0.6449]])

Numpy knows that you have to center your data - so you don't need centered_data.

PCA components are not 1:1 the eigenvalues.

Correct eigenvalue decomposition:

numpy.linalg.eig(numpy.cov(data.transpose()))

(array([ 0.0490834 ,  1.28402771]),
 array([[-0.73517866, -0.6778734 ],
        [ 0.6778734 , -0.73517866]]))

Using the biased estimator yields different Eigenvalues (again, underestimating variance), but the same Eigenvectors:

(array([ 0.04417506,  1.15562494]), ...

Note that the Eigenvectors are not yet sorted by the largest Eigenvalues.

As the name of pca.explained_variance_ratio_ indicates, these are not the Eigenvalues. They are the ratio. If we take the (biased, underestimating) eigenvalues, and normalize them to have a sum of 1, we get

s/sum(s)

array([ 0.03681869,  0.96318131])

Also, the pca.transform method of scipy apparently does not apply scaling. IMHO, when using PCA, it is also fairly common to scale each component to have unit variance. This obviously does not hold for this output. Then the result would be (with the two columns swapped, I did not bother to change this)

s, e = numpy.linalg.eig(numpy.cov(data.transpose()))
o=numpy.argsort(s)[::-1]
(data-mean).dot(e[:,o]) / numpy.sqrt(s[o])

array([[-0.73068047, -0.79041795],
       [ 1.56870773,  0.64481466],
       [-0.87561043,  1.73495337],
       [-0.24198963,  0.58866414],
       [-1.47888824, -0.94561319],
       [-0.80567404,  0.79117236],
       [ 0.08746369, -1.57900372],
       [ 1.01008049,  0.20951358],
       [ 0.38657401,  0.08018421],
       [ 1.08001688, -0.73426743]])

(As you can see, PCA is just three lines in numpy, so you don't need a function for this.)

Why do I think this is the proper result? Because the resulting data set has the property that it's covariance matrix is (except for rounding errors) the identity matrix. Without scaling, the covariance matrix is numpy.diag(s[o]). But one may also argue that by applying the scaling, I "lost" the variance information, that would have been kept otherwise.

In my opinion, `scipy` is using the wrong (biased) covariance. `numpy` is correct.

But more often than not, it does not matter much. In above ratio, the bias cancels out. And if you have a large data set, the difference between using the naive 1/n and the unbiased 1/(n-1) eventually becomes neglibile. But also the difference comes at effectively zero CPU cost, so you might as well use the unbiased variance estimate.

133

answered Sep 18 '22 06:09

Has QUIT--Anony-Mousse

Related questions
                            
                                sklearn auc ValueError: Only one class present in y_true
                            
                                How to get Agglomerative Clustering "Centroid" in python Scikit-learn
                            
                                scikits learn and nltk: Naive Bayes classifier performance highly different
                            
                                Efficient k-means evaluation with silhouette score in sklearn
                            
                                How to use scikit's preprocessing/normalization along with cross validation?
                            
                                Accuracy score in pyTorch LSTM
                            
                                Using scikit-learn (sklearn), how to handle missing data for linear regression?
                            
                                How to estimate the progress of a GridSearchCV from verbose output in Scikit-Learn?
                            
                                Using Pandas 'categorical' dtype with sklearn
                            
                                How to get comparable and reproducible results from LogisticRegressionCV and GridSearchCV
                            
                                Complex dataset split - StratifiedGroupShuffleSplit
                            
                                unable to use FeatureUnion in scikit-learn due to different dimensions
                            
                                Can you fix the false negative rate in a classifier in scikit learn
                            
                                Scikit and Pandas: Fitting Large Data
                            
                                How to identify Cluster labels in kmeans scikit learn
                            
                                Graphviz.Source not rendering in Jupyter Notebook
                            
                                sklearn import error - ImportError: cannot import name 'comb'
                            
                                Sklearn: adding lemmatizer to CountVectorizer
                            
                                Scikit learn - fit_transform on the test set
                            
                                How to Find Documents That are in the same Cluster with KMeans

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit-Learn PCA

Tags:

statistics

scikit-learn

linear-algebra

pca

slaw

People also ask

1 Answers

In my opinion, `scipy` is using the wrong (biased) covariance. `numpy` is correct.

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us

Scikit-Learn PCA

Tags:

statistics

scikit-learn

linear-algebra

pca

slaw

People also ask

1 Answers

In my opinion, scipy is using the wrong (biased) covariance. numpy is correct.

Has QUIT--Anony-Mousse

Related questions

Recent Activity

Donate For Us

In my opinion, `scipy` is using the wrong (biased) covariance. `numpy` is correct.