PCA projection and reconstruction in scikit-learn

Tags:

I can perform PCA in scikit by code below: X_train has 279180 rows and 104 columns.

from sklearn.decomposition import PCA pca = PCA(n_components=30) X_train_pca = pca.fit_transform(X_train)

Now, when I want to project the eigenvectors onto feature space, I must do following:

""" Projection """ comp = pca.components_ #30x104 com_tr = np.transpose(pca.components_) #104x30 proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30

But I am hesitating with this step, because Scikit documentation says:

components_: array, [n_components, n_features]

Principal axes in feature space, representing the directions of maximum variance in the data.

It seems to me, that it is already projected, but when I checked the source code, it returns only the eigenvectors.

What is the right way how to project it?

Ultimately, I am aiming to calculate the MSE of reconstruction.

""" Reconstruct """ recon = np.dot(proj,comp) #297180x30 * 30x104 = 279180x104  """  MSE Error """ print "MSE = %.6G" %(np.mean((X_train - recon)**2))

964

asked Apr 12 '16 07:04

HonzaB

2 Answers

You can do

proj = pca.inverse_transform(X_train_pca)

That way you do not have to worry about how to do the multiplications.

What you obtain after pca.fit_transform or pca.transform are what is usually called the "loadings" for each sample, meaning how much of each component you need to describe it best using a linear combination of the components_ (the principal axes in feature space).

The projection you are aiming at is back in the original signal space. This means that you need to go back into signal space using the components and the loadings.

So there are three steps to disambiguate here. Here you have, step by step, what you can do using the PCA object and how it is actually calculated:

pca.fit estimates the components (using an SVD on the centered Xtrain):

 from sklearn.decomposition import PCA  import numpy as np  from numpy.testing import assert_array_almost_equal   #Should this variable be X_train instead of Xtrain?  X_train = np.random.randn(100, 50)   pca = PCA(n_components=30)  pca.fit(X_train)   U, S, VT = np.linalg.svd(X_train - X_train.mean(0))   assert_array_almost_equal(VT[:30], pca.components_)

pca.transform calculates the loadings as you describe

 X_train_pca = pca.transform(X_train)   X_train_pca2 = (X_train - pca.mean_).dot(pca.components_.T)   assert_array_almost_equal(X_train_pca, X_train_pca2)

pca.inverse_transform obtains the projection onto components in signal space you are interested in

 X_projected = pca.inverse_transform(X_train_pca)  X_projected2 = X_train_pca.dot(pca.components_) + pca.mean_   assert_array_almost_equal(X_projected, X_projected2)

You can now evaluate the projection loss

loss = np.sum((X_train - X_projected) ** 2, axis=1).mean()

105

answered Sep 20 '22 18:09

eickenberg

Adding on @eickenberg's post, here is how to do the pca reconstruction of digits' images:

import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn import decomposition  n_components = 10 image_shape = (8, 8)  digits = load_digits() digits = digits.data  n_samples, n_features = digits.shape estimator = decomposition.PCA(n_components=n_components, svd_solver='randomized', whiten=True) digits_recons = estimator.inverse_transform(estimator.fit_transform(digits))  # show 5 randomly chosen digits and their PCA reconstructions with 10 dominant eigenvectors indices = np.random.choice(n_samples, 5, replace=False) plt.figure(figsize=(5,2)) for i in range(len(indices)):     plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits[indices[i],:], image_shape)), plt.axis('off') plt.suptitle('Original', size=25) plt.show() plt.figure(figsize=(5,2)) for i in range(len(indices)):     plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits_recons[indices[i],:], image_shape)), plt.axis('off') plt.suptitle('PCA reconstructed'.format(n_components), size=25) plt.show()

enter image description here

answered Sep 22 '22 18:09

2 revs

Related questions
                            
                                OrderedDict performance (compared to deque)
                            
                                Python ftplib - specify port
                            
                                Should I unittest private/protected method
                            
                                Merge dataframes in a dictionary
                            
                                Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency
                            
                                Random Forest with GridSearchCV - Error on param_grid
                            
                                CX_Oracle - import data from Oracle to Pandas dataframe
                            
                                Numpy equivalent of if/else without loop
                            
                                How to copy a Python class instance if deepcopy() does not work?
                            
                                How to use ``xlrd.xldate_as_tuple()``
                            
                                What is the difference between django classonlymethod and python classmethod?
                            
                                How do I set color to Rectangle in Matplotlib?
                            
                                Read a local file in django
                            
                                pySerial write() won't take my string
                            
                                Excluding directory
                            
                                What is the g object in this Flask code?
                            
                                Explicitly declaring variable as unused in Python/PyCharm
                            
                                How does the min/max function on a nested list work?
                            
                                PySpark and broadcast join example
                            
                                Can we send message to user in slack using python script?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PCA projection and reconstruction in scikit-learn

Tags:

python

machine-learning

scikit-learn

pca

HonzaB

People also ask

2 Answers

eickenberg

2 revs

Recent Activity

Donate For Us