I'm using scikit-learn to perform PCA on this dataset. The scikit-learn documentation states that
Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.
The problem is that I don't think that I'm using different estimator objects, but the signs of some of my PCs are flipped, when compared to results in SAS's PROC PRINCOMP
procedure.
For the first observation in the dataset, the SAS PCs are:
PC1 PC2 PC3 PC4 PC5
2.0508 1.9600 -0.1663 0.2965 -0.0121
From scikit-learn, I get the following (which are very close in magnitude):
PC1 PC2 PC3 PC4 PC5
-2.0536 -1.9627 -0.1666 -0.297 -0.0122
Here's what I'm doing:
import pandas as pd
import numpy as np
from sklearn.decomposition.pca import PCA
sourcef = pd.read_csv('C:/mydata.csv')
frame = pd.DataFrame(sourcef)
# Some pandas evals, regressions, etc... that I'm not showing
# but not affecting the matrix
# Make sure we are working with the proper data -- drop the response variable
cols = [col for col in frame.columns if col not in ['response']]
# Separate out the data matrix from the response variable vector
# into numpy arrays
frame2_X = frame[cols].values
frame2_y = frame['response'].values
# Standardize the values
X_means = np.mean(frame2_X,axis=0)
X_stds = np.std(frame2_X,axis=0)
y_mean = np.mean(frame2_y)
y_std = np.std(frame2_y)
frame2_X_stdz = np.copy(frame2_X)
frame2_y_stdz = frame2_y.astype(numpy.float32, copy=True)
for (x,y), value in np.ndenumerate(frame2_X_stdz):
frame2_X_stdz[x][y] = (value - X_means[y])/X_stds[y]
for index, value in enumerate(frame2_y_stdz):
frame2_y_stdz[index] = (float(value) - y_mean)/y_std
# Show the first 5 elements of the standardized values, to verify
print frame2_X_stdz[:,0][:5]
# Show the first 5 lines from the standardized response vector, to verify
print frame2_y_stdz[:5]
Those check out ok:
[ 0.9508 -0.5847 -0.2797 -0.4039 -0.598 ]
[ 1.0726 -0.5009 -0.0942 -0.1187 -0.8043]
Continuing on...
# Create a PCA object
pca = PCA()
pca.fit(frame2_X_stdz)
# Create the matrix of PC estimates
pca.transform(frame2_X_stdz)
Here's the output of the last step:
Out[16]: array([[-2.0536, -1.9627, -0.1666, -0.297 , -0.0122],
[ 1.382 , -0.382 , -0.5692, -0.0257, -0.0509],
[ 0.4342, 0.611 , 0.2701, 0.062 , -0.011 ],
...,
[ 0.0422, 0.7251, -0.1926, 0.0089, 0.0005],
[ 1.4502, -0.7115, -0.0733, 0.0013, -0.0557],
[ 0.258 , 0.3684, 0.1873, 0.0403, 0.0042]])
I've tried it by replacing the pca.fit()
and pca.transform()
with pca.fit_transform()
, but I end up with the same results.
What am I doing wrong here that I'm getting PCs with the signs flipped?
When you call transform you're asking sklearn to actually do the projection. That is, you are asking it to project each row of your data into the vector space that was learned when fit was called.
2.3- Inverse transformation to reconstruct the data After compressing the data by reducing the dimensionality using PCA, we can reconstruct the data and return it to its original dimension by inverse the transformation, there will be an information losses, we cant reconstruct the original data 100% (ex.
In a sense, PCA is a kind of matrix factorization, since it decomposes a matrix X into WΣVT. However, matrix factorization is a very general term. Also, see this answer on math.
Returns: X_newarray-like of shape (n_samples, n_components) Projection of X in the first principal components, where n_samples is the number of samples and n_components is the number of the components.
You're doing nothing wrong.
What the documentation is warning you about is that repeated calls to fit
may yield different principal components - not how they relate to another PCA implementation.
Having a flipped sign on all components doesn't make the result wrong - the result is right as long as it fulfills the definition (each component is chosen such that it captures the maximum amount of variance in the data). As it stands, it seems the projection you got is simply mirrored - it still fulfills the definition, and is, thus, correct.
If, beneath correctness, you're worried about consistency between implementations, you can simply multiply the components by -1, when it's necessary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With