Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to map the results of Principal Component Analysis back to the actual features that were fed into the model?

When I run the code below, I see the 'pca.explained_variance_ratio_' and a histogram, which shows the proportion of variance explained for each feature.

import statsmodels.api as sm
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats import anova

mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)

x = df.iloc[:,2:]

from sklearn.preprocessing import StandardScaler

pca = PCA(n_components=11)
principalComponents = pca.fit_transform(df)


# Plotting the variances for each PC
PC = range(1, pca.n_components_+1)
plt.bar(PC, pca.explained_variance_ratio_, color='gold')
plt.xlabel('Principal Components')
plt.ylabel('Variance %')
plt.xticks(PC)

enter image description here

How can I map PCA 1 and 2 back to the original features in the data frame?

like image 921
ASH Avatar asked Nov 01 '25 20:11

ASH


1 Answers

Each PCs is a linear combination of your variables. For example PC1 will be (where X1,X2 represents each dependent variable):

enter image description here

So w11, w22 would be your loadings and they would represent each features influence on the associated PC. Basically they would show the amount of correlation with your PCs See this post or maybe something like this blog for an explanation with sklearn.

In your example, you did not scale your data before PC, so the vector with the most loading would be the one with the largest magnitude. So let's do this properly:

import statsmodels.api as sm
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
X = StandardScaler().fit_transform(df)
pca = PCA(n_components=11)
pca.fit(X)

# Plotting the variances for each PC
PC = range(1, pca.n_components_+1)
plt.bar(PC, pca.explained_variance_ratio_, color='gold')
plt.xlabel('Principal Components')
plt.ylabel('Variance %')
plt.xticks(PC)

enter image description here

These are the loadings:

PCnames = ['PC'+str(i+1) for i in range(pca.n_components_)]
Loadings = pd.DataFrame(pca.components_,columns=PCnames,index=df.columns)

Loadings.iloc[:,:2]

                PC1        PC2
    mpg      0.362531   -0.373916
    cyl      0.016124    0.043744
    disp    -0.225744   -0.175311
    hp      -0.022540   -0.002592
    drat     0.102845    0.058484
    wt      -0.108797    0.168554
    qsec     0.367724    0.057278
    vs       0.754091    0.230825
    am      -0.235702   -0.054035
    gear     0.139285   -0.846419
    carb     0.124896    0.140695 

Roughly this means observations with high vs will have higher scores on the first component (PC1) and those will high am will have low scores likewise. And you can check their influence:

Loadings["PC1"].sort_values().plot.barh()

enter image description here

like image 126
StupidWolf Avatar answered Nov 03 '25 10:11

StupidWolf



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!