I'm am doing PCA and I am interested in which original features were most important. Let me illustrate this with an example: <pre class="prettyprint"><code>import numpy as np from sklearn.decomposition import PCA X = np.array([[1,-1, -1,-1], [1,-2, -1,-1], [1,-3, -2,-1], [1,1, 1,-1], [1,2,1,-1], [1,3, 2,-0.5]]) print(X) </code></pre> Which outputs: <pre class="prettyprint"><code>[[ 1. -1. -1. -1. ] [ 1. -2. -1. -1. ] [ 1. -3. -2. -1. ] [ 1. 1. 1. -1. ] [ 1. 2. 1. -1. ] [ 1. 3. 2. -0.5]] </code></pre> Intuitively, one could already say that feature 1 and feature 4 are not very important due to their low variance. Let's apply pca on this set: <pre class="prettyprint"><code>pca = PCA(n_components=2) pca.fit_transform(X) comps = pca.components_ </code></pre> Output: <pre class="prettyprint"><code>array([[ 0. , 0.8376103 , 0.54436943, 0.04550712], [-0. , 0.54564656, -0.8297757 , -0.11722679]]) </code></pre> This output represents the importance of each original feature for each of the two principal components (see this for reference). In other words, for the first principal component, feature 2 is most important, then feature 3. For the second principal component, feature 3 looks most important. The question is, which feature is most important, which one second most etc? Can I use the <code>component_</code> attribute for this? Or am I wrong and is PCA not the correct method for doing such analyses (and should I use a feature selection method instead)?

The <code>component_</code> attribute is not the right spot to look for feature importance. The loadings in the two arrays (i.e. the two componments PC1 and PC2) tell you how your original matrix is transformed by each feature (taken together, they form a rotational matrix). But they don't tell you how much each component contributes to describing the transformed feature space, so you don't know yet how to compare the loadings across the two components. However, the answer that you linked actually tells you what to use instead: the <code>explained_variance_ratio_</code> attribute. This attribute tells you how much of the variance in your feature space is explained by each principal component: <pre class="prettyprint lang-py prettyprint-override"><code>In [5]: pca.explained_variance_ratio_ Out[5]: array([ 0.98934303, 0.00757996]) </code></pre> This means that the first prinicpal component explaines almost 99 percent of the variance. You know from <code>components_</code> that PC1 has the highest loading for the second feature. It follows, therefore, that feature 2 is the most important feature in your data space. Feature 3 is the next most important feature, as it has the second highest loading in PC1. In PC2, the absolute loadings are nearly swapped between feature 2 and feature 3. But as PC2 explains next to nothing of the overall variance, this can be neglected.

Most important original feature(s) of Principal Component Analysis

Tags:

python

scikit-learn

feature-selection

pca

I'm am doing PCA and I am interested in which original features were most important. Let me illustrate this with an example:

import numpy as np
from sklearn.decomposition import PCA
X = np.array([[1,-1, -1,-1], [1,-2, -1,-1], [1,-3, -2,-1], [1,1, 1,-1], [1,2,1,-1], [1,3, 2,-0.5]])
print(X)

Which outputs:

[[ 1.  -1.  -1.  -1. ]
[ 1.  -2.  -1.  -1. ]
[ 1.  -3.  -2.  -1. ]
[ 1.   1.   1.  -1. ]
[ 1.   2.   1.  -1. ]
[ 1.   3.   2.  -0.5]]

Intuitively, one could already say that feature 1 and feature 4 are not very important due to their low variance. Let's apply pca on this set:

pca = PCA(n_components=2)
pca.fit_transform(X)
comps = pca.components_

Output:

array([[ 0.        ,  0.8376103 ,  0.54436943,  0.04550712],
       [-0.        ,  0.54564656, -0.8297757 , -0.11722679]])

This output represents the importance of each original feature for each of the two principal components (see this for reference). In other words, for the first principal component, feature 2 is most important, then feature 3. For the second principal component, feature 3 looks most important.

The question is, which feature is most important, which one second most etc? Can I use the component_ attribute for this? Or am I wrong and is PCA not the correct method for doing such analyses (and should I use a feature selection method instead)?

358

asked Feb 23 '17 17:02

Guido

1 Answers

The component_ attribute is not the right spot to look for feature importance. The loadings in the two arrays (i.e. the two componments PC1 and PC2) tell you how your original matrix is transformed by each feature (taken together, they form a rotational matrix). But they don't tell you how much each component contributes to describing the transformed feature space, so you don't know yet how to compare the loadings across the two components.

However, the answer that you linked actually tells you what to use instead: the explained_variance_ratio_ attribute. This attribute tells you how much of the variance in your feature space is explained by each principal component:

In [5]: pca.explained_variance_ratio_
Out[5]: array([ 0.98934303,  0.00757996])

This means that the first prinicpal component explaines almost 99 percent of the variance. You know from components_ that PC1 has the highest loading for the second feature. It follows, therefore, that feature 2 is the most important feature in your data space. Feature 3 is the next most important feature, as it has the second highest loading in PC1.

In PC2, the absolute loadings are nearly swapped between feature 2 and feature 3. But as PC2 explains next to nothing of the overall variance, this can be neglected.

answered Oct 18 '22 09:10

Schmuddi

Related questions
                            
                                Can't insert single column value in python using MySQL
                            
                                How to check if a Threading.Timer object is currently running in python
                            
                                Change the Color and Font of QString or QLineEdit
                            
                                Run a foreign exe inside a Python GUI (PyQt)
                            
                                AES Encryption in PowerShell and Python
                            
                                Asyncio with Django
                            
                                Showing Deprecation Warnings Only for a Specific Version When Testing Django
                            
                                How to count files inside zip in AWS S3 without downloading it?
                            
                                Call plpgsql Function from a PL/Python Function in PostgreSQL
                            
                                How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string?
                            
                                Error when profiling an otherwise perfectly working multiprocessing python script with cProfile
                            
                                iterate over days (pandas)
                            
                                Python wsgi:ssl-error Can't connect to HTTPS URL because the SSL module is not available
                            
                                Python requests post json raw data
                            
                                Using np.where to find matching row in 2D array
                            
                                Use a custom failure message for `assertRaises()` in Python?
                            
                                How can I unit test the jinja2 template logic?
                            
                                What is the best method for using Datashader to plot data from a NumPy array?
                            
                                Prevent long lines getting wrapped in ruamel.yaml
                            
                                Python interface pattern and unit test code coverage

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With