Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-Learn PCA

I am using input data from here (see Section 3.1).

I am trying to reproduce their covariance matrix, eigenvalues, and eigenvectors using scikit-learn. However, I am unable to reproduce the results as presented in the data source. I've also seen this input data elsewhere but I can't discern whether it's a problem with scikit-learn, my steps, or the data source.

data = np.array([[2.5,2.4],
                 [0.5,0.7],
                 [2.2,2.9],
                 [1.9,2.2],
                 [3.1,3.0],
                 [2.3,2.7],
                 [2.0,1.6],
                 [1.0,1.1],
                 [1.5,1.6],
                 [1.1,0.9],
                 ]) 

centered_data = data-data.mean(axis=0)
pca = PCA()
pca.fit(centered_data)
print(pca.get_covariance()) #Covariance Matrix

array([[ 0.5549,  0.5539],
   [ 0.5539,  0.6449]])

print(pca.explained_variance_ratio_) #Eigenvalues (normalized)

[ 0.96318131  0.03681869]

print(pca.components_) #Eigenvectors

[[-0.6778734  -0.73517866]
 [ 0.73517866 -0.6778734 ]]

Surprisingly, the projections matches the results from the data source described above.

print(pca.transform(centered_data)) #Projections

array([[-0.82797019,  0.17511531],
   [ 1.77758033, -0.14285723],
   [-0.99219749, -0.38437499],
   [-0.27421042, -0.13041721],
   [-1.67580142,  0.20949846],
   [-0.9129491 , -0.17528244],
   [ 0.09910944,  0.3498247 ],
   [ 1.14457216, -0.04641726],
   [ 0.43804614, -0.01776463],
   [ 1.22382056,  0.16267529]])

Here is what I don't understand:

  1. Why is the covariance matrix is different?
  2. Updated: How do I get eigenvalues from scikit-learn that are not already normalized?
like image 727
slaw Avatar asked Dec 30 '14 04:12

slaw


People also ask

How do you do PCA with Scikit learn?

Performing PCA using Scikit-Learn is a two-step process: Initialize the PCA class by passing the number of components to the constructor. Call the fit and then transform methods by passing the feature set to these methods. The transform method returns the specified number of principal components.

What is the PCA components in Sklearn?

From that project, and this answer over on StackOverflow, we can learn that pca. components_ is the set of all eigenvectors (aka loadings) for your projection space (one eigenvector for each principal component). Once you have the eigenvectors using pca. components_ , here's how to get eigenvalues.

Does Sklearn PCA scale the data?

PCA is effected by scale so you need to scale the features in the data before applying PCA. You can transform the data onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms. StandardScaler helps standardize the dataset's features.

What does PCA do in Python?

Uses of PCA:It is used to find inter-relation between variables in the data. It is used to interpret and visualize data. The number of variables is decreasing it makes further analysis simpler. It's often used to visualize genetic distance and relatedness between populations.


1 Answers

Correct covariance matrix of this data:

numpy.cov(data.transpose())
array([[ 0.61655556,  0.61544444],
       [ 0.61544444,  0.71655556]])

Biased (i.e. "incorrect", using the wrong normalization term, and underestimating variance in the data set) covariance matrix:

numpy.cov(data.transpose(), bias=1)
array([[ 0.5549,  0.5539],
       [ 0.5539,  0.6449]])

Numpy knows that you have to center your data - so you don't need centered_data.

PCA components are not 1:1 the eigenvalues.

Correct eigenvalue decomposition:

numpy.linalg.eig(numpy.cov(data.transpose()))
(array([ 0.0490834 ,  1.28402771]),
 array([[-0.73517866, -0.6778734 ],
        [ 0.6778734 , -0.73517866]]))

Using the biased estimator yields different Eigenvalues (again, underestimating variance), but the same Eigenvectors:

(array([ 0.04417506,  1.15562494]), ...

Note that the Eigenvectors are not yet sorted by the largest Eigenvalues.

As the name of pca.explained_variance_ratio_ indicates, these are not the Eigenvalues. They are the ratio. If we take the (biased, underestimating) eigenvalues, and normalize them to have a sum of 1, we get

s/sum(s)
array([ 0.03681869,  0.96318131])

Also, the pca.transform method of scipy apparently does not apply scaling. IMHO, when using PCA, it is also fairly common to scale each component to have unit variance. This obviously does not hold for this output. Then the result would be (with the two columns swapped, I did not bother to change this)

s, e = numpy.linalg.eig(numpy.cov(data.transpose()))
o=numpy.argsort(s)[::-1]
(data-mean).dot(e[:,o]) / numpy.sqrt(s[o])
array([[-0.73068047, -0.79041795],
       [ 1.56870773,  0.64481466],
       [-0.87561043,  1.73495337],
       [-0.24198963,  0.58866414],
       [-1.47888824, -0.94561319],
       [-0.80567404,  0.79117236],
       [ 0.08746369, -1.57900372],
       [ 1.01008049,  0.20951358],
       [ 0.38657401,  0.08018421],
       [ 1.08001688, -0.73426743]])

(As you can see, PCA is just three lines in numpy, so you don't need a function for this.)

Why do I think this is the proper result? Because the resulting data set has the property that it's covariance matrix is (except for rounding errors) the identity matrix. Without scaling, the covariance matrix is numpy.diag(s[o]). But one may also argue that by applying the scaling, I "lost" the variance information, that would have been kept otherwise.

In my opinion, scipy is using the wrong (biased) covariance. numpy is correct.

But more often than not, it does not matter much. In above ratio, the bias cancels out. And if you have a large data set, the difference between using the naive 1/n and the unbiased 1/(n-1) eventually becomes neglibile. But also the difference comes at effectively zero CPU cost, so you might as well use the unbiased variance estimate.

like image 133
Has QUIT--Anony-Mousse Avatar answered Sep 18 '22 06:09

Has QUIT--Anony-Mousse