I am running PCA on my data (~250 features) and see that all points are clustered in 3 blobs.
Is it possible to see which of the 250 features have been most contributing to the outcome? if so how?
(using the Scikit-learn implementation)
The importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude — higher importance). we can conclude that feature 1, 3 and 4 are the most important for PC1. Similarly, we can state that feature 2 and then 1 are the most important for PC2.
The contribution is a scaled version of the squared correlation between variables and component axes (or the cosine, from a geometrical point of view) --- this is used to assess the quality of the representation of the variables of the principal component, and it is computed as cos(variable,axis)2×100 / total cos2 of ...
Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 20 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations.
If our sole intention of doing PCA is for data visualization, the best number of components is 2 or 3. If we really want to reduce the size of the dataset, the best number of principal components is much less than the number of variables in the original dataset.
Let's see what wikipedia says:
PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
To get how 'influent' are vectors from original space in the smaller one you have to project them as well. Which is done by:
res = pca.transform(np.eye(D))
np.eye(n)
creates a n x n
diagonal matrix (one on diagonal, 0 otherwise).np.eye(D)
is your features in original feature spaceres
is the projection of your features in lower space.The interesting thing is that res
is a D x d
matrix where res[i][j] represent "how much feature i contribute to component j"
Then, you may just sum over columns to get a D x 1
matrix (call it contributiion where each contribution[i]
is the total contribution of feature i.
Sort it and you find the most contributing feature :)
Not sure its clear, could add any kind of additional information.
Hope this helps, pltrdy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With