Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn PCA: matrix transformation produces PC estimates with flipped signs

I'm using scikit-learn to perform PCA on this dataset. The scikit-learn documentation states that

Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.

The problem is that I don't think that I'm using different estimator objects, but the signs of some of my PCs are flipped, when compared to results in SAS's PROC PRINCOMP procedure.

For the first observation in the dataset, the SAS PCs are:

PC1      PC2      PC3       PC4      PC5
2.0508   1.9600   -0.1663   0.2965   -0.0121

From scikit-learn, I get the following (which are very close in magnitude):

PC1      PC2      PC3       PC4      PC5
-2.0536  -1.9627  -0.1666   -0.297   -0.0122

Here's what I'm doing:

import pandas as pd
import numpy  as np
from sklearn.decomposition.pca import PCA

sourcef = pd.read_csv('C:/mydata.csv')
frame = pd.DataFrame(sourcef)

# Some pandas evals, regressions, etc... that I'm not showing
# but not affecting the matrix

# Make sure we are working with the proper data -- drop the response variable
cols = [col for col in frame.columns if col not in ['response']]

# Separate out the data matrix from the response variable vector 
# into numpy arrays
frame2_X = frame[cols].values
frame2_y = frame['response'].values

# Standardize the values
X_means = np.mean(frame2_X,axis=0)
X_stds  = np.std(frame2_X,axis=0)

y_mean = np.mean(frame2_y)
y_std  = np.std(frame2_y)

frame2_X_stdz = np.copy(frame2_X)
frame2_y_stdz = frame2_y.astype(numpy.float32, copy=True)

for (x,y), value in np.ndenumerate(frame2_X_stdz):
    frame2_X_stdz[x][y] = (value - X_means[y])/X_stds[y]

for index, value in enumerate(frame2_y_stdz):
    frame2_y_stdz[index] = (float(value) - y_mean)/y_std

# Show the first 5 elements of the standardized values, to verify
print frame2_X_stdz[:,0][:5]

# Show the first 5 lines from the standardized response vector, to verify
print frame2_y_stdz[:5]

Those check out ok:

[ 0.9508 -0.5847 -0.2797 -0.4039 -0.598 ]
[ 1.0726 -0.5009 -0.0942 -0.1187 -0.8043]

Continuing on...

# Create a PCA object
pca = PCA()
pca.fit(frame2_X_stdz)

# Create the matrix of PC estimates
pca.transform(frame2_X_stdz)

Here's the output of the last step:

Out[16]: array([[-2.0536, -1.9627, -0.1666, -0.297 , -0.0122],
       [ 1.382 , -0.382 , -0.5692, -0.0257, -0.0509],
       [ 0.4342,  0.611 ,  0.2701,  0.062 , -0.011 ],
       ..., 
       [ 0.0422,  0.7251, -0.1926,  0.0089,  0.0005],
       [ 1.4502, -0.7115, -0.0733,  0.0013, -0.0557],
       [ 0.258 ,  0.3684,  0.1873,  0.0403,  0.0042]])

I've tried it by replacing the pca.fit() and pca.transform() with pca.fit_transform(), but I end up with the same results.

What am I doing wrong here that I'm getting PCs with the signs flipped?

like image 312
Clay Avatar asked Jan 14 '14 14:01

Clay


People also ask

What does Sklearn PCA transform do?

When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called.

What does PCA inverse transform do?

2.3- Inverse transformation to reconstruct the data After compressing the data by reducing the dimensionality using PCA, we can reconstruct the data and return it to its original dimension by inverse the transformation, there will be an information losses, we cant reconstruct the original data 100% (ex.

How does matrix factorization work in PCA?

In a sense, PCA is a kind of matrix factorization, since it decomposes a matrix X into WΣVT. However, matrix factorization is a very general term. Also, see this answer on math.

What does Sklearn PCA return?

Returns: X_newarray-like of shape (n_samples, n_components) Projection of X in the first principal components, where n_samples is the number of samples and n_components is the number of the components.


1 Answers

You're doing nothing wrong.

What the documentation is warning you about is that repeated calls to fit may yield different principal components - not how they relate to another PCA implementation.

Having a flipped sign on all components doesn't make the result wrong - the result is right as long as it fulfills the definition (each component is chosen such that it captures the maximum amount of variance in the data). As it stands, it seems the projection you got is simply mirrored - it still fulfills the definition, and is, thus, correct.

If, beneath correctness, you're worried about consistency between implementations, you can simply multiply the components by -1, when it's necessary.

like image 148
loopbackbee Avatar answered Nov 06 '22 07:11

loopbackbee