I'm attempting to use sklearn's PCA functionality to reduce my data to 2 dimensions. However, I noticed when I do this using the fit_transform() function the result does not match the result of multiplying the components_ attribute with my input data.
Why don't these match? Which result is correct?
def test_pca_fit_transform(self):
from sklearn.decomposition import PCA
input_data = np.matrix([[11,4,9,3,2,2], [7,2,8,2,0,2], [3,1,2,5,2,9]])
#each column of input data is an observation, each row is a dimension
#method1
pca = PCA(n_components=2)
data2d = pca.fit_transform(input_data.T)
#method2
component_matrix = np.matrix(pca.components_)
data2d_mult = (component_matrix * input_data).T
np.testing.assert_almost_equal(data2d, data2d_mult)
#FAILS!!!
The only step you are missing (which sklearn handles internally) is the data centering. In order to perform PCA your data needs to be centered, if its not, one of the first lines of sklearn's PCA's fit method is:
X -= X.mean(axis=0)
Which centers your data along the first axis.
In order to achieve the same result as sklearn (which is the correct one), you just need to center your data either before fit or before your method2.
Find here a working example:
X = np.array([[11,4,9,3,2,2], [7,2,8,2,0,2], [3,1,2,5,2,9]])
X = X.T.copy()
# PCA
pca = PCA(n_components=2)
data = pca.fit_transform(X)
# Your method 2
data2 = X.dot(pca.components_.T)
# Centering the data before method 2
data3 = X - X.mean(axis=0)
data3 = data3.dot(pca.components_.T)
# Compare
print np.allclose(data, data2) # prints False
print np.allclose(data, data3) # prints True
Note that I use .dot on standard numpy arrays instead of * in numpy matrix as I prefer to avoid using matrix whenever possible, but the result is the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With