Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use scikit-learn PCA for features reduction and know which features are discarded

I am trying to run a PCA on a matrix of dimensions m x n where m is the number of features and n the number of samples.

Suppose I want to preserve the nf features with the maximum variance. With scikit-learn I am able to do it in this way:

from sklearn.decomposition import PCA  nf = 100 pca = PCA(n_components=nf) # X is the matrix transposed (n samples on the rows, m features on the columns) pca.fit(X)  X_new = pca.transform(X) 

Now, I get a new matrix X_new that has a shape of n x nf. Is it possible to know which features have been discarded or the retained ones?

Thanks

like image 275
gc5 Avatar asked Apr 25 '14 13:04

gc5


People also ask

How do you know what features are important in PCA?

PCA technique is particularly useful in processing data where multi-colinearity exists between the features/variables. PCA can be used when the dimensions of the input features are high (e.g. a lot of variables). PCA can be also used for denoising and data compression.

Is PCA used for feature reduction?

The only way PCA is a valid method of feature selection is if the most important variables are the ones that happen to have the most variation in them . However this is usually not true. As an example, imagine you want to model the probability that an NFL team makes the playoffs.

How do you apply principal component analysis PCA when solving problems using Scikit-learn?

Performing PCA using Scikit-Learn is a two-step process: Initialize the PCA class by passing the number of components to the constructor. Call the fit and then transform methods by passing the feature set to these methods. The transform method returns the specified number of principal components.


2 Answers

The features that your PCA object has determined during fitting are in pca.components_. The vector space orthogonal to the one spanned by pca.components_ is discarded.

Please note that PCA does not "discard" or "retain" any of your pre-defined features (encoded by the columns you specify). It mixes all of them (by weighted sums) to find orthogonal directions of maximum variance.

If this is not the behaviour you are looking for, then PCA dimensionality reduction is not the way to go. For some simple general feature selection methods, you can take a look at sklearn.feature_selection

like image 162
eickenberg Avatar answered Oct 16 '22 07:10

eickenberg


The projected features onto principal components will retain the important information (axes with maximum variances) and drop axes with small variances. This behavior is like to compression (Not discard).

And X_proj is the better name of X_new, because it is the projection of X onto principal components

You can reconstruct the X_rec as

X_rec = pca.inverse_transform(X_proj) # X_proj is originally X_new 

Here, X_rec is close to X, but the less important information was dropped by PCA. So we can say X_rec is denoised.

In my opinion, I can say the noise is discard.

like image 27
emeth Avatar answered Oct 16 '22 07:10

emeth