I am doing Principle Component Analysis (PCA) and I'd like to find out which features that contribute the most to the result. My intuition is to sum up all the absolute values of the individual contribution of the features to the individual components. <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np from sklearn.decomposition import PCA X = np.array([[-1, -1, 4, 1], [-2, -1, 4, 2], [-3, -2, 4, 3], [1, 1, 4, 4], [2, 1, 4, 5], [3, 2, 4, 6]]) pca = PCA(n_components=0.95, whiten=True, svd_solver='full').fit(X) pca.components_ </code></pre> <pre class="prettyprint"><code>array([[ 0.71417303, 0.46711713, 0. , 0.52130459], [-0.46602418, -0.23839061, -0. , 0.85205128]]) </code></pre> <pre class="prettyprint lang-py prettyprint-override"><code>np.sum(np.abs(pca.components_), axis=0) </code></pre> <pre class="prettyprint"><code>array([1.18019721, 0.70550774, 0. , 1.37335586]) </code></pre> This yields, in my eyes, a measure of importance of each of the original features. Note that the 3rd feature has zero importance, because I intentionally created a column that is just a constant value. Is there a better "measure of importance" for PCA?

If you just purely sum the PCs with <code>np.sum(np.abs(pca.components_), axis=0)</code>, that assumes all PCs are equally important which is rarely true. To use PCA for crude feature selection, sum after discarding low-contribution PCs and/or after scaling the PCs by their relative contributions. Here is a visual example that highlights why a plain sum doesn't work as desired. Given 3 observations of 20 features (visualized as three 5x4 heatmaps): <pre class="prettyprint lang-py prettyprint-override"><code>>>> print(X.T) [[2 1 1 1 1 1 1 1 1 4 1 1 1 4 1 1 1 1 1 2] [1 1 1 1 1 1 1 1 1 4 1 1 1 6 3 1 1 1 1 2] [1 1 1 2 1 1 1 1 1 5 2 1 1 5 1 1 1 1 1 2]] </code></pre> <img src="https://i.stack.imgur.com/2Lnvt.png" alt="original data"> These are the resulting PCs: <pre class="prettyprint lang-py prettyprint-override"><code>>>> pca = PCA(n_components=None, whiten=True, svd_solver='full').fit(X.T) </code></pre> <img src="https://i.stack.imgur.com/6MkvY.png" alt="principal components"> Note that PC3 has high magnitude at <code>(2,1)</code>, but if we check its explained variance, it offers ~0 contribution: <pre class="prettyprint lang-py prettyprint-override"><code>>>> print(pca.explained_variance_ratio_) array([0.6638886943392722, 0.3361113056607279, 2.2971091700327738e-32]) </code></pre> This causes a feature selection discrepancy when summing the unscaled PCs (left) vs summing the PCs scaled by their explained variance ratios (right): <pre class="prettyprint lang-py prettyprint-override"><code>>>> unscaled = np.sum(np.abs(pca.components_), axis=0) >>> scaled = np.sum(pca.explained_variance_ratio_[:, None] * np.abs(pca.components_), axis=0) </code></pre> <img src="https://i.stack.imgur.com/IO1GT.png" alt="unscaled vs scaled PC sums"> With the unscaled sum (left), the meaningless PC3 is still given 33% weight. This causes <code>(2,1)</code> to be considered the most important feature, but if we look back to the original data, <code>(2,1)</code> offers low discrimination between observations. With the scaled sum (right), PC1 and PC2 respectively have 66% and 33% weight. Now <code>(3,1)</code> and <code>(3,2)</code> are the most important features which actually tracks with the original data.

Measure of Feature Importance in PCA

Tags:

python

scikit-learn

pca

I am doing Principle Component Analysis (PCA) and I'd like to find out which features that contribute the most to the result.

My intuition is to sum up all the absolute values of the individual contribution of the features to the individual components.

import numpy as np
from sklearn.decomposition import PCA

X = np.array([[-1, -1, 4, 1], [-2, -1, 4, 2], [-3, -2, 4, 3], [1, 1, 4, 4], [2, 1, 4, 5], [3, 2, 4, 6]])
pca = PCA(n_components=0.95, whiten=True, svd_solver='full').fit(X)
pca.components_

array([[ 0.71417303,  0.46711713,  0.        ,  0.52130459],
       [-0.46602418, -0.23839061, -0.        ,  0.85205128]])

np.sum(np.abs(pca.components_), axis=0)

array([1.18019721, 0.70550774, 0.        , 1.37335586])

This yields, in my eyes, a measure of importance of each of the original features. Note that the 3rd feature has zero importance, because I intentionally created a column that is just a constant value.

Is there a better "measure of importance" for PCA?

770

asked Apr 21 '21 16:04

r0f1

2 Answers

The measure of importance for PCA is in explained_variance_ratio_. This array provides percentage of variance explained by each component. It is sorted by importance of the components in descending order and sums up to 1 when all the components are used, or minimal possible value above the requested threshold. In your example you set a threshold to 95% (of variance that should be explained), so the array sum will be 0.9949522861608583 as the first component explains 92.021143% and the second 7.474085% of the variance, hence the 2 components you receive.

components_ is the array that stores the directions of maximum variance in the feature space. It's dimensions are n_components_ by n_features_. This is what you multiply the data point(s) by when applying transform() to get reduced dimensionality projection of the data.

update

In order to get the percentage of contribution of the original features to each of the Principal Components, you just need to normalize components_, as they set the amount original vectors contribute to the projection.

r = np.abs(pca.components_.T)
r/r.sum(axis=0)

array([[0.41946155, 0.29941172],
       [0.27435603, 0.15316146],
       [0.        , 0.        ],
       [0.30618242, 0.54742682]])

As you can see third feature does not contribute to the PCs.

If you need the total contribution of the original features to the explained variance, you need to take into account each PC contribution (i.e. explained_variance_ratio_):

ev = np.abs(pca.components_.T).dot(pca.explained_variance_ratio_)
ttl_ev = pca.explained_variance_ratio_.sum()*ev/ev.sum()
print(ttl_ev)

[0.40908847 0.26463667 0.         0.32122715]

answered Oct 09 '22 02:10

igrinis

If you just purely sum the PCs with np.sum(np.abs(pca.components_), axis=0), that assumes all PCs are equally important which is rarely true. To use PCA for crude feature selection, sum after discarding low-contribution PCs and/or after scaling the PCs by their relative contributions.

Here is a visual example that highlights why a plain sum doesn't work as desired.

Given 3 observations of 20 features (visualized as three 5x4 heatmaps):

>>> print(X.T)
[[2 1 1 1 1 1 1 1 1 4 1 1 1 4 1 1 1 1 1 2]
 [1 1 1 1 1 1 1 1 1 4 1 1 1 6 3 1 1 1 1 2]
 [1 1 1 2 1 1 1 1 1 5 2 1 1 5 1 1 1 1 1 2]]

original data

These are the resulting PCs:

>>> pca = PCA(n_components=None, whiten=True, svd_solver='full').fit(X.T)

principal components

Note that PC3 has high magnitude at (2,1), but if we check its explained variance, it offers ~0 contribution:

>>> print(pca.explained_variance_ratio_)
array([0.6638886943392722, 0.3361113056607279, 2.2971091700327738e-32])

This causes a feature selection discrepancy when summing the unscaled PCs (left) vs summing the PCs scaled by their explained variance ratios (right):

>>> unscaled = np.sum(np.abs(pca.components_), axis=0)
>>> scaled = np.sum(pca.explained_variance_ratio_[:, None] * np.abs(pca.components_), axis=0)

unscaled vs scaled PC sums

With the unscaled sum (left), the meaningless PC3 is still given 33% weight. This causes (2,1) to be considered the most important feature, but if we look back to the original data, (2,1) offers low discrimination between observations.

With the scaled sum (right), PC1 and PC2 respectively have 66% and 33% weight. Now (3,1) and (3,2) are the most important features which actually tracks with the original data.

answered Oct 09 '22 02:10

tdy

Related questions
                            
                                Run python script in jenkins
                            
                                Change Logdir of Ray RLlib Training instead of ~/ray_results
                            
                                How does Poetry work regarding binary dependencies? (esp. numpy)
                            
                                What this error means: `y` argument is not supported when using python generator as input
                            
                                Tkinter and 32-bit Unicode duplicating – any fix?
                            
                                Gunicorn: Failed to find attribute 'app' in 'wsgi' when attempting to start flask server
                            
                                Xarray combine_by_coords return the monotonic global index error
                            
                                how to change the python version from default 3.5 to 3.8 of google colab
                            
                                Does imblearn pipeline turn off sampling for testing?
                            
                                How to setup two PyPI indices
                            
                                "ObjectId' object is not iterable" error, while fetching data from MongoDB Atlas
                            
                                Matplotlib doesn't save image in fullscreen
                            
                                Sampling a fixed length sequence from a numpy array
                            
                                “SSL: CERTIFICATE_VERIFY_FAILED” Error when publish MQTT, AWS IoT
                            
                                The command, pip install --upgrade pip, install all version of pip
                            
                                How is Python's iterator unpacking (star unpacking) implemented (or, what magic methods are involved in unpacking a custom iterator?)
                            
                                ERROR: Could not build wheels for pymssql which use PEP 517 and cannot be installed directly
                            
                                How to create a correct pie chart with manim
                            
                                Find the index of first non-zero element to the right of given elements in python
                            
                                Is there a way to return a previously defined object when initiating a new one?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With