I am running PCA on my data (~250 features) and see that all points are clustered in 3 blobs. Is it possible to see which of the 250 features have been most contributing to the outcome? if so how? (using the Scikit-learn implementation)

Let's see what wikipedia says: <blockquote> PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. </blockquote> To get how 'influent' are vectors from original space in the smaller one you have to project them as well. Which is done by: <pre class="prettyprint"><code>res = pca.transform(np.eye(D)) </code></pre> <ul> <li> <code>np.eye(n)</code> creates a <code>n x n</code> diagonal matrix (one on diagonal, 0 otherwise).</li> <li>Thus, <code>np.eye(D)</code> is your features in original feature space</li> <li> <code>res</code> is the projection of your features in lower space.</li> </ul> The interesting thing is that <code>res</code> is a <code>D x d</code> matrix where res[i][j] represent "how much feature i contribute to component j" Then, you may just sum over columns to get a <code>D x 1</code> matrix (call it contributiion where each <code>contribution[i]</code> is the total contribution of feature i. Sort it and you find the most contributing feature :) Not sure its clear, could add any kind of additional information. Hope this helps, pltrdy

How to find most contributing features to PCA?

1 Answers

Let's see what wikipedia says:

PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

To get how 'influent' are vectors from original space in the smaller one you have to project them as well. Which is done by:

res = pca.transform(np.eye(D))

np.eye(n) creates a n x n diagonal matrix (one on diagonal, 0 otherwise).
Thus, np.eye(D) is your features in original feature space
res is the projection of your features in lower space.

The interesting thing is that res is a D x d matrix where res[i][j] represent "how much feature i contribute to component j"

Then, you may just sum over columns to get a D x 1 matrix (call it contributiion where each contribution[i] is the total contribution of feature i.

Sort it and you find the most contributing feature :)

Not sure its clear, could add any kind of additional information.

Hope this helps, pltrdy

158

answered Oct 24 '22 00:10

pltrdy

Related questions
                            
                                How to measure xgboost regressor accuracy using accuracy_score (or other suggested function)
                            
                                sklearn ColumnTransformer with MultilabelBinarizer
                            
                                Do I need to split the data for isolation forest?
                            
                                sklearn utils compute_class_weight function for large dataset
                            
                                Diminishing the impact of one variable over output in a regression model
                            
                                Why is KNN so much faster with cosine distance than Euclidean distance?
                            
                                Randomized stratified k-fold cross-validation in scikit-learn?
                            
                                How to deal with combination of text and numeric features?
                            
                                Can sklearn Random Forest classifier adjust sample size by tree, to handle class imbalance?
                            
                                Using the class sklearn.cluster.SpectralClustering with parameter affinity='precomputed'
                            
                                Cost sensitive analysis in scikit-learn
                            
                                How to add oversampling/undersampling procedure in scikit's Pipeline?
                            
                                sklearn: Regression models on sparse data?
                            
                                Accuracy of model is 0.86 while AUC is 0.50?
                            
                                finding number of documents per topic for LDA with scikit-learn
                            
                                XgBoost : The least populated class in y has only 1 members, which is too few
                            
                                How does sklearn compute the precision_score metric?
                            
                                How to read in an edge list to make a scipy sparse matrix
                            
                                How to predict new values using statsmodels.formula.api (python)
                            
                                Confusion Matrix for 10-fold cross validation in scikit learn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find most contributing features to PCA?

Tags:

scikit-learn

pca

oshi2016

People also ask

1 Answers

pltrdy

Recent Activity

Donate For Us