Recovering features names of explained_variance_ratio_ in PCA with sklearn

Tags:

I'm trying to recover from a PCA done with scikit-learn, which features are selected as relevant.

A classic example with IRIS dataset.

import pandas as pd import pylab as pl from sklearn import datasets from sklearn.decomposition import PCA  # load dataset iris = datasets.load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names)  # normalize data df_norm = (df - df.mean()) / df.std()  # PCA pca = PCA(n_components=2) pca.fit_transform(df_norm.values) print pca.explained_variance_ratio_

This returns

In [42]: pca.explained_variance_ratio_ Out[42]: array([ 0.72770452,  0.23030523])

How can I recover which two features allow these two explained variance among the dataset ? Said diferently, how can i get the index of this features in iris.feature_names ?

In [47]: print iris.feature_names ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Thanks in advance for your help.

828

asked Apr 10 '14 09:04

mazieres

2 Answers

This information is included in the pca attribute: components_. As described in the documentation, pca.components_ outputs an array of [n_components, n_features], so to get how components are linearly related with the different features you have to:

Note: each coefficient represents the correlation between a particular pair of component and feature

import pandas as pd import pylab as pl from sklearn import datasets from sklearn.decomposition import PCA  # load dataset iris = datasets.load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names)  # normalize data from sklearn import preprocessing data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns)   # PCA pca = PCA(n_components=2) pca.fit_transform(data_scaled)  # Dump components relations with features: print(pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1','PC-2']))        sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm) PC-1           0.522372         -0.263355           0.581254          0.565611 PC-2          -0.372318         -0.925556          -0.021095         -0.065416

IMPORTANT: As a side comment, note the PCA sign does not affect its interpretation since the sign does not affect the variance contained in each component. Only the relative signs of features forming the PCA dimension are important. In fact, if you run the PCA code again, you might get the PCA dimensions with the signs inverted. For an intuition about this, think about a vector and its negative in 3-D space - both are essentially representing the same direction in space. Check this post for further reference.

answered Oct 07 '22 10:10

Rafa

Edit: as others have commented, you may get same values from .components_ attribute.

Each principal component is a linear combination of the original variables:

pca-coef

where X_is are the original variables, and Beta_is are the corresponding weights or so called coefficients.

To obtain the weights, you may simply pass identity matrix to the transform method:

>>> i = np.identity(df.shape[1])  # identity matrix >>> i array([[ 1.,  0.,  0.,  0.],        [ 0.,  1.,  0.,  0.],        [ 0.,  0.,  1.,  0.],        [ 0.,  0.,  0.,  1.]])  >>> coef = pca.transform(i) >>> coef array([[ 0.5224, -0.3723],        [-0.2634, -0.9256],        [ 0.5813, -0.0211],        [ 0.5656, -0.0654]])

Each column of the coef matrix above shows the weights in the linear combination which obtains corresponding principal component:

>>> pd.DataFrame(coef, columns=['PC-1', 'PC-2'], index=df.columns)                     PC-1   PC-2 sepal length (cm)  0.522 -0.372 sepal width (cm)  -0.263 -0.926 petal length (cm)  0.581 -0.021 petal width (cm)   0.566 -0.065  [4 rows x 2 columns]

For example, above shows that the second principal component (PC-2) is mostly aligned with sepal width, which has the highest weight of 0.926 in absolute value;

Since the data were normalized, you can confirm that the principal components have variance 1.0 which is equivalent to each coefficient vector having norm 1.0:

>>> np.linalg.norm(coef,axis=0) array([ 1.,  1.])

One may also confirm that the principal components can be calculated as the dot product of the above coefficients and the original variables:

>>> np.allclose(df_norm.values.dot(coef), pca.fit_transform(df_norm.values)) True

Note that we need to use numpy.allclose instead of regular equality operator, because of floating point precision error.

answered Oct 07 '22 10:10

behzad.nouri

Related questions
                            
                                How to have logarithmic bins in a Python histogram
                            
                                Probability to z-score and vice versa
                            
                                Python unittest.TestCase execution order
                            
                                How do I update a Mongo document after inserting it?
                            
                                Get a list of all the encodings Python can encode to
                            
                                Python unittest passing arguments
                            
                                Enumerate two python lists simultaneously?
                            
                                TypeError: the JSON object must be str, not 'bytes'
                            
                                Possible values from sys.platform?
                            
                                How to do Xavier initialization on TensorFlow
                            
                                Where do I get a SECRET_KEY for Flask?
                            
                                Get lat/long given current point, distance and bearing
                            
                                how to add value to a tuple?
                            
                                Read a small random sample from a big CSV file into a Python data frame
                            
                                Python 3 sort a dict by its values
                            
                                How to delete columns in numpy.array
                            
                                Installing OpenCV for Python on Ubuntu, getting ImportError: No module named cv2.cv
                            
                                How to Pretty Print HTML to a file, with indentation
                            
                                How to print a specific row of a pandas DataFrame?
                            
                                How to convert signed to unsigned integer in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Recovering features names of explained_variance_ratio_ in PCA with sklearn

Tags:

python

machine-learning

scikit-learn

pca

mazieres

People also ask

2 Answers

Rafa

behzad.nouri

Recent Activity

Donate For Us