Inversing PCA transform with sklearn (with whiten=True)

Tags:

Usually PCA transform is easily inversed:

import numpy as np
from sklearn import decomposition

x = np.zeros((500, 10))
x[:, :5] = random.rand(500, 5)
x[:, 5:] = x[:, :5] # so that using PCA would make sense

p = decomposition.PCA()
p.fit(x)

a = x[5, :]

print p.inverse_transform(p.transform(a)) - a  # this yields small numbers (about 10**-16)

Now, if we try to add whiten=True parameter the result will be entirely different:

p = decomposition.PCA(whiten=True)
p.fit(x)

a = x[5, :]

print p.inverse_transform(p.transform(a)) - a  # now yields numbers about 10**15

So, as I didn't find any other methods that would do the trick, I wounder how is it possible to obtain the original value of a? Or is it even possible at all? Thanks a lot for any help.

982

asked Apr 23 '14 20:04

mikhail

2 Answers

This behaviour is admittedly potentially weird, but it is nevertheless documented in the docstrings of the relevant functions.

The class docstring of PCA says the following about whiten:

whiten : bool, optional
    When True (False by default) the `components_` vectors are divided
    by n_samples times singular values to ensure uncorrelated outputs
    with unit component-wise variances.

    Whitening will remove some information from the transformed signal
    (the relative variance scales of the components) but can sometime
    improve the predictive accuracy of the downstream estimators by
    making there data respect some hard-wired assumptions.

The code and docstring of PCA.inverse_transform says:

def inverse_transform(self, X):
    """Transform data back to its original space, i.e.,
    return an input X_original whose transform would be X

    Parameters
    ----------
    X : array-like, shape (n_samples, n_components)
        New data, where n_samples is the number of samples
        and n_components is the number of components.

    Returns
    -------
    X_original array-like, shape (n_samples, n_features)

    Notes
    -----
    If whitening is enabled, inverse_transform does not compute the
    exact inverse operation as transform.
    """
    return np.dot(X, self.components_) + self.mean_

Now take a look at what happens when whiten=True in the function PCA._fit:

    if self.whiten:
        self.components_ = V / S[:, np.newaxis] * np.sqrt(n_samples)
    else:
        self.components_ = V

where S are singular values and V are singular vectors. By definition, whitening levels the spectrum, essentially setting all of the eigenvalues of the covariance matrix to 1.

In order to finally answer your question: The PCA object of sklearn.decomposition does not allow reconstructing original data from the whitened matrix, ~~because the singular values of the centered data / the eigenvalues of the covariance matrix are garbage collected after the function PCA._fit.~~

However, if you obtain the singular values S manually, you will be able to multiply them back and arrive back at your original data.

Try this

import numpy as np
rng = np.random.RandomState(42)

n_samples_train, n_features = 40, 10
n_samples_test = 20
X_train = rng.randn(n_samples_train, n_features)
X_test = rng.randn(n_samples_test, n_features)

from sklearn.decomposition import PCA
pca = PCA(whiten=True)

pca.fit(X_train)

X_train_mean = X_train.mean(0)
X_train_centered = X_train - X_train_mean
U, S, VT = np.linalg.svd(X_train_centered, full_matrices=False)
components = VT / S[:, np.newaxis] * np.sqrt(n_samples_train)

from numpy.testing import assert_array_almost_equal
# These assertions will raise an error if the arrays aren't equal
assert_array_almost_equal(components, pca.components_)  # we have successfully 
                                                        # calculated whitened components

transformed = pca.transform(X_test)
inverse_transformed = transformed.dot(S[:, np.newaxis] ** 2 * pca.components_ /
                                            n_samples_train) + X_train_mean

assert_array_almost_equal(inverse_transformed, X_test)  # We have equality

As you can see from the line creating inverse_transformed, if you multiply the singular values back to the components, you can return to the original space.

As a matter of fact, the singular values S are actually hidden in the norms of the components, so there is no need to calculate an SVD along side the PCA. Using the definitions above one can see

S_recalculated = 1. / np.sqrt((pca.components_ ** 2).sum(axis=1) / n_samples_train)
assert_array_almost_equal(S, S_recalculated)

Conclusion: By obtaining the singular values of the centered data matrix, we are able to undo the whitening and transform back to the original space. However, this functionality is not natively implemented in the PCA object.

Remedy: Without modifying the code of scikit learn (which may be done officially if considered useful by the community), the solution you are looking for is this (and I will now use your code and variable names, please check if this works for you):

transformed_a = p.transform(a)
singular_values = 1. / np.sqrt((p.components_ ** 2).sum(axis=1) / len(x))
inverse_transformed = np.dot(transformed_a, singular_values[:, np.newaxis] ** 2 *
                                          p.components_ / len(x)) + p.mean_)

(IMHO the inverse_transform function of any estimator should go back a close as possible to the original data. In this case it would not cost much to also store the singular values explicitly, so maybe this functionality should actually be added to sklearn.)

EDIT The singular values of the centered matrix are not garbage collected as initially thought. As a matter of fact, they are stored in pca.explained_variance_ and can be used to unwhiten. See comments.

109

answered Nov 15 '22 10:11

eickenberg

self.components_ is originally Eignenvectors which subject to

>>> np.allclose(self.components_.T, np.linalg.inv(self.components_))
True

To project(transform in sklearn) to these components, PCA subtracts their self.mean_ and multiplies self.components_ as

   Y = np.dot(X - self.mean_, self.components_.T) 
=> Y = (X - mean) * V.T # rewritten for simple notation

where X is the samples, mean is the mean of the training samples, and V is the principal components.

Then the reconstruction(inverse_transform in sklearn) is as follows (to get Y from X)

   Y = (X - mean) * V.T
=> Y*inv(V.T) = X - mean
=> Y*V = X - mean # inv(V.T) = V
=> X = Y*V + mean
=> Xrec = np.dot(X, self.components_) + self.mean_

The problem is self.components_ of whiten PCA does not subject to

>>> np.allclose(self.components_.T, np.linalg.inv(self.components_))
False

You can derive the reason why from the code by @eickenberg.

So, you need to modify sklearn.decomposition.pca

the code retains the reconstruction matrix. self.components_ of whiten PCA is

self.components_ = V / S[:, np.newaxis] * np.sqrt(n_samples)

So we can assign the reconstruction matrix as

self.recons_ = V * S[:, np.newaxis] / np.sqrt(n_samples)

When inverse_transform is called, we will return the result derived by this matrix as
```
if self.whiten:
    return np.dot(X, self.recons_) + self.mean_
```

That's it. Let's test.

>>> p = decomposition.PCA(whiten=True)
>>> p.fit(x)
>>> np.allclose(p.inverse_transform(p.transform(a)), a)
True

Sorry for my English. Please improve this post I am not sure theses expressions are correct.

answered Nov 15 '22 10:11

emeth

Related questions
                            
                                Django form.cleaned_data is None
                            
                                Python Variable "resetting"
                            
                                Internal server error Flask
                            
                                Permission denied when i try to execute a python script from bash? [duplicate]
                            
                                In Python (2.7), why is os.remove not identical to os.unlink?
                            
                                How to calculate dice coefficient for measuring accuracy of image segmentation in python
                            
                                Assign specific colours to data in Matplotlib pie chart
                            
                                ImportError: cannot import name COMError in python
                            
                                First argument of sys.path.insert in python
                            
                                In Python, is it possible to expose modules from subpackages at package level?
                            
                                Add a date column in pandas df using constant value in str
                            
                                "CREATE ... statement not allowed within multi-statement transaction" when using pyodbc
                            
                                How to find shortest path in a weighted graph using networkx?
                            
                                Wait until a Jenkins build is complete
                            
                                How to initialize nested dictionaries in Python
                            
                                Multithreading in Python/BeautifulSoup scraping doesn't speed up at all
                            
                                Is it possible to import class method without instantiating class?
                            
                                Location of stored offline data for cartopy
                            
                                How to get random.sample() from deque in Python 3?
                            
                                matplotlib hatched and filled histograms

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Inversing PCA transform with sklearn (with whiten=True)

Tags:

python-2.7

scikit-learn

pca

mikhail

People also ask

2 Answers

eickenberg

emeth

Recent Activity

Donate For Us