Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python scikit learn pca.explained_variance_ratio_ cutoff

When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.

However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to "99% of variance is retained"? Could anyone enlighten? Thanks.

  • The Python Scikit learn PCA manual is here

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

like image 729
Chubaka Avatar asked Sep 30 '15 03:09

Chubaka


People also ask

What is PCA explained_variance_ratio_ Cumsum ()?

explained_variance_ratio_. cumsum() . That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.

How much variance should PCA explain?

Variance explained by factor analysis must not maximum of 100% but it should not be less than 60%. It should not be less than 60%. If the variance explained is 35%, it shows the data is not useful, and may need to revisit measures, and even the data collection process.

Does Sklearn PCA scale the data?

PCA is effected by scale so you need to scale the features in the data before applying PCA. You can transform the data onto unit scale (mean = 0 and variance = 1) which is a requirement for the optimal performance of many machine learning algorithms. StandardScaler helps standardize the dataset's features.

What does N_components mean in PCA?

pca = PCA(n_components = number of Principal Components )


2 Answers

Yes, you are nearly right. The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension.

You probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.

import numpy as np from sklearn.decomposition import PCA  np.random.seed(0) my_matrix = np.random.randn(20, 5)  my_model = PCA(n_components=5) my_model.fit_transform(my_matrix)  print my_model.explained_variance_ print my_model.explained_variance_ratio_ print my_model.explained_variance_ratio_.cumsum() 

[ 1.50756565  1.29374452  0.97042041  0.61712667  0.31529082] [ 0.32047581  0.27502207  0.20629036  0.13118776  0.067024  ] [ 0.32047581  0.59549787  0.80178824  0.932976    1.        ] 

So in my random toy data, if I picked k=4 I would retain 93.3% of the variance.

like image 152
Curt F. Avatar answered Sep 21 '22 13:09

Curt F.


Although this question is older than 2 years i want to provide an update on this. I wanted to do the same and it looks like sklearn now provides this feature out of the box.

As stated in the docs

if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components

So the code required is now

my_model = PCA(n_components=0.99, svd_solver='full') my_model.fit_transform(my_matrix) 
like image 29
Yannic Klem Avatar answered Sep 20 '22 13:09

Yannic Klem