I wanted to implement PCA with a class similar to the one in sklearn
.
My algorithm for finding PCA with k principal component is as follows:
import numpy as np
class MyPCA:
def __init__(self, n_components):
self.n_components = n_components
def fit_transform(self, X):
"""
Assumes observations in X are passed as rows of a numpy array.
"""
# Translate the dataset so it's centered around 0
translated_X = X - np.mean(X, axis=0)
# Calculate the eigenvalues and eigenvectors of the covariance matrix
e_values, e_vectors = np.linalg.eigh(np.cov(translated_X.T))
# Sort eigenvalues and their eigenvectors in descending order
e_ind_order = np.flip(e_values.argsort())
e_values = e_values[e_ind_order]
e_vectors = e_vectors[e_ind_order]
# Save the first n_components eigenvectors as principal components
principal_components = np.take(e_vectors, np.arange(self.n_components), axis=0)
return np.matmul(translated_X, principal_components.T)
However, when run on Iris dataset, this implementation produces vastly different results than sklearn's one, and the results do not show that there are three different groups within the data:
from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
def plot_pca_results(pca_class, dataset, plot_title):
X = dataset.data
y = dataset.target
y_names = dataset.target_names
pca = pca_class(n_components=1)
B = pca.fit_transform(X)
B = np.concatenate([B, np.zeros_like(B)], 1)
scatter = plt.scatter(B[:, 0], B[:, 1], c=y)
scatter_objects, _ = scatter.legend_elements()
plt.title(plot_title)
plt.legend(scatter_objects, y_names, loc="lower left", title="Classes")
plt.show()
dataset = datasets.load_iris()
plot_pca_results(MyPCA, dataset, "Iris - my PCA")
plot_pca_results(PCA, dataset, "Iris - Sklearn")
What might be the cause for such differences? Where is my approach, or my calculations incorrect?
Performing PCA using Scikit-Learn is a two-step process: Initialize the PCA class by passing the number of components to the constructor. Call the fit and then transform methods by passing the feature set to these methods. The transform method returns the specified number of principal components.
A simplified introduction to the PCA for machine learning. Principal Component Analysis (PCA) is one of the most commonly used unsupervised machine learning algorithms across a variety of applications: exploratory data analysis, dimensionality reduction, information compression, data de-noising, and plenty more.
What is the difference between SVD and PCA? SVD gives you the whole nine-yard of diagonalizing a matrix into special matrices that are easy to manipulate and to analyze. It lay down the foundation to untangle data into independent components. PCA skips less significant components.
Comparing the two methods
The issue is with not standardizing the data and the extraction of eigen vectors (principal axes). This function compares the two methods.
import numpy as np
from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def pca_comparison(X, n_components, labels):
"""X: Standardized dataset, observations on rows
n_components: dimensionality of the reduced space
labels: targets, for visualization
"""
# numpy
# -----
# calculate eigen values
X_cov = np.cov(X.T)
e_values, e_vectors = np.linalg.eigh(X_cov)
# Sort eigenvalues and their eigenvectors in descending order
e_ind_order = np.flip(e_values.argsort())
e_values = e_values[e_ind_order]
e_vectors = e_vectors[:, e_ind_order] # note that we have to re-order the columns, not rows
# now we can project the dataset on to the eigen vectors (principal axes)
prin_comp_evd = X @ e_vectors
# sklearn
# -------
pca = PCA(n_components=n_components)
prin_comp_sklearn = pca.fit_transform(X)
# plotting
if n_components == 3:
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(121, projection='3d')
ax.scatter(prin_comp_sklearn[:, 0],
prin_comp_sklearn[:, 1],
prin_comp_sklearn[:, 1],
c=labels)
ax.set_title("sklearn plot")
ax = fig.add_subplot(122, projection='3d')
ax.scatter(prin_comp_evd[:, 0],
prin_comp_evd[:, 1],
prin_comp_evd[:, 2],
c=labels)
ax.set_title("PCA using EVD plot")
fig.suptitle(f"Plots for reducing to {n_components}-D")
plt.show()
elif n_components == 2:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].scatter(prin_comp_sklearn[:, 0],
prin_comp_sklearn[:, 1],
c=labels)
ax[0].set_title("sklearn plot")
ax[1].scatter(prin_comp_evd[:, 0],
prin_comp_evd[:, 1],
c=labels)
ax[1].set_title("PCA using EVD plot")
fig.suptitle(f"Plots for reducing to {n_components}-D")
plt.show()
elif n_components == 1:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].scatter(prin_comp_sklearn[:, 0],
np.zeros_like(prin_comp_sklearn[:, 0]),
c=labels)
ax[0].set_title("sklearn plot")
ax[1].scatter(prin_comp_evd[:, 0],
np.zeros_like(prin_comp_evd[:, 0]),
c=labels)
ax[1].set_title("PCA using EVD plot")
fig.suptitle(f"Plots for reducing to {n_components}-D")
plt.show()
return prin_comp_sklearn, prin_comp_evd[:, :n_components]
Loading the data set, pre-processing and running the experiment:
dataset = datasets.load_iris()
X = dataset.data
mean = np.mean(X, axis=0)
# this was missing in your implementation
std = np.std(X, axis=0)
X_std = (X - mean) / std
for n in [3, 2, 1]:
pca_comparison(X_std, n, dataset.target)
Results
3D plot is a bit cluttered, but if you look at the 2D and 1D cases, you'll notice the plots are the same if we multiply the first principal component by -1; scikit-learn PCA implementation uses Singular Value Decomposition under the hood, which will give non-unique solutions (see here).
Test:
Using the flip_signs()
function from here
def flip_signs(A, B):
"""
utility function for resolving the sign ambiguity in SVD
http://stats.stackexchange.com/q/34396/115202
"""
signs = np.sign(A) * np.sign(B)
return A, B * signs
for n in [3, 2, 1]:
sklearn_pca, evd = pca_comparison(X_std, n, dataset.target)
assert np.allclose(*flip_signs(sklearn_pca, evd))
Issues in your implementation:
Quoting a part of the above answer:
Continued by @ttnphns
When would one prefer to do PCA (or factor analysis or other similar type of analysis) on correlations (i.e. on z-standardized variables) instead of doing it on covariances (i.e. on centered variables)?
When the variables are different units of measurement. That's clear
...
e_values, e_vectors = np.linalg.eigh(X_cov)
, you should extract columns of e_vectors
(documentation). You were extracting rows.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With