Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding and utilizing eigenvalues and eigenvectors from PCA in scikit-learn

I have been utilizing PCA implemented in scikit-learn. However, I want to find the eigenvalues and eigenvectors that result after we fit the training dataset. There is no mention of both in the docs.

Secondly, can these eigenvalues and eigenvectors themselves be utilized as features for classification purposes?

like image 698
Amol Agrawal Avatar asked Jun 10 '16 21:06

Amol Agrawal


1 Answers

I am assuming here that by EigenVectors you mean the Eigenvectors of the Covariance Matrix.

Lets say that you have n data points in a p-dimensional space, and X is a p x n matrix of your points then the directions of the principal components are the Eigenvectors of the Covariance matrix XXT. You can obtain the directions of these EigenVectors from sklearn by accessing the components_ attribute of the PCA object. This can be done as follows:

from sklearn.decomposition import PCA
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA()
pca.fit(X)
print pca.components_

This gives an output like

[[ 0.83849224  0.54491354]
[ 0.54491354 -0.83849224]]

where every row is a principal component in the p-dimensional space (2 in this toy example). Each of these rows is an Eigenvector of the centered covariance matrix XXT.

As far as the Eigenvalues go, there is no straightforward way to get them from the PCA object. The PCA object does have an attribute called explained_variance_ratio_ which gives the percentage of the variance of each component. These numbers for each component are proportional to the Eigenvalues. In the case of our toy example, we get these if print the explained_variance_ratio_ attribute :

[ 0.99244289  0.00755711]

This means that the ratio of the eigenvalue of the first principal component to the eigenvalue of the second principal component is 0.99244289:0.00755711.

If the understanding of the basic mathematics of PCA is clear, then a better way to get the Eigenvectors and Eigenvalues is to use numpy.linalg.eig to get Eigenvalues and Eigenvectors of the centered covariance matrix. If your data matrix is a p x n matrix, X (p features, n points), then the you can use the following code:

import numpy as np
centered_matrix = X - X.mean(axis=1)[:, np.newaxis]
cov = np.dot(centered_matrix, centered_matrix.T)
eigvals, eigvecs = np.linalg.eig(cov)

Coming to your second question. These EigenValues and EigenVectors cannot be used themselves for classification. For classification you need features for each data point. These Eigenvectors and Eigenvalues that you generate are derived from the entire covariance matrix, XXT. For dimensionality reduction you could use the projections of your original points(in the p-dimensional space) on the principal components obtained as a result of PCA. However, this is also not always useful, because PCA does not take into account the labels of your training data. I would recommend you to look into LDA for supervised problems.

Hope that helps.

like image 74
Abhinav Arora Avatar answered Oct 31 '22 13:10

Abhinav Arora