Implementing PCA with Numpy

Tags:

I wanted to implement PCA with a class similar to the one in sklearn.

My algorithm for finding PCA with k principal component is as follows:

Compute the sample mean and translate the dataset so that it's centered around the origin.
Compute the covariance matrix of the new, translated set.
Find the eigenvalues and eigenvectors, sort them in descending order.
Project the dataset onto the vector space spanned by the first k eigenvectors.

import numpy as np


class MyPCA:
    def __init__(self, n_components):
        self.n_components = n_components

    def fit_transform(self, X):
        """
        Assumes observations in X are passed as rows of a numpy array.
        """

        # Translate the dataset so it's centered around 0
        translated_X = X - np.mean(X, axis=0)

        # Calculate the eigenvalues and eigenvectors of the covariance matrix
        e_values, e_vectors = np.linalg.eigh(np.cov(translated_X.T))

        # Sort eigenvalues and their eigenvectors in descending order
        e_ind_order = np.flip(e_values.argsort())
        e_values = e_values[e_ind_order]
        e_vectors = e_vectors[e_ind_order]

        # Save the first n_components eigenvectors as principal components
        principal_components = np.take(e_vectors, np.arange(self.n_components), axis=0)

        return np.matmul(translated_X, principal_components.T)

However, when run on Iris dataset, this implementation produces vastly different results than sklearn's one, and the results do not show that there are three different groups within the data:

Click to copy

from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


def plot_pca_results(pca_class, dataset, plot_title):
    X = dataset.data
    y = dataset.target
    y_names = dataset.target_names

    pca = pca_class(n_components=1)
    B = pca.fit_transform(X)
    B = np.concatenate([B, np.zeros_like(B)], 1)

    scatter = plt.scatter(B[:, 0], B[:, 1], c=y)
    scatter_objects, _ = scatter.legend_elements()
    plt.title(plot_title)
    plt.legend(scatter_objects, y_names, loc="lower left", title="Classes")
    plt.show()


dataset = datasets.load_iris()
plot_pca_results(MyPCA, dataset, "Iris - my PCA")
plot_pca_results(PCA, dataset, "Iris - Sklearn")

Results of my PCA Intended results

What might be the cause for such differences? Where is my approach, or my calculations incorrect?

819

asked Nov 01 '19 22:11

Angie

1 Answers

Comparing the two methods

The issue is with not standardizing the data and the extraction of eigen vectors (principal axes). This function compares the two methods.

Click to copy

import numpy as np
from sklearn import datasets
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D 

def pca_comparison(X, n_components, labels):
  """X: Standardized dataset, observations on rows
     n_components: dimensionality of the reduced space
     labels: targets, for visualization
  """

  # numpy
  # -----

  # calculate eigen values
  X_cov = np.cov(X.T)
  e_values, e_vectors = np.linalg.eigh(X_cov)

  # Sort eigenvalues and their eigenvectors in descending order
  e_ind_order = np.flip(e_values.argsort())
  e_values = e_values[e_ind_order]
  e_vectors = e_vectors[:, e_ind_order] # note that we have to re-order the columns, not rows

  # now we can project the dataset on to the eigen vectors (principal axes)
  prin_comp_evd = X @ e_vectors

  # sklearn
  # -------

  pca = PCA(n_components=n_components)
  prin_comp_sklearn = pca.fit_transform(X)

  # plotting

  if n_components == 3:
    fig = plt.figure(figsize=(10, 5))
    ax = fig.add_subplot(121, projection='3d')
    ax.scatter(prin_comp_sklearn[:, 0],
                  prin_comp_sklearn[:, 1],
                  prin_comp_sklearn[:, 1],
                  c=labels)
    ax.set_title("sklearn plot")

    ax = fig.add_subplot(122, projection='3d')
    ax.scatter(prin_comp_evd[:, 0],
                  prin_comp_evd[:, 1],
                  prin_comp_evd[:, 2],
                  c=labels)
    ax.set_title("PCA using EVD plot")
    fig.suptitle(f"Plots for reducing to {n_components}-D")
    plt.show()

  elif n_components == 2:
    fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    ax[0].scatter(prin_comp_sklearn[:, 0], 
                prin_comp_sklearn[:, 1],
                c=labels)
    ax[0].set_title("sklearn plot")
    ax[1].scatter(prin_comp_evd[:, 0], 
                prin_comp_evd[:, 1],
                c=labels)
    ax[1].set_title("PCA using EVD plot")
    fig.suptitle(f"Plots for reducing to {n_components}-D")
    plt.show()

  elif n_components == 1:
    fig, ax = plt.subplots(1, 2, figsize=(10, 5))
    ax[0].scatter(prin_comp_sklearn[:, 0], 
                np.zeros_like(prin_comp_sklearn[:, 0]),
                c=labels)
    ax[0].set_title("sklearn plot")
    ax[1].scatter(prin_comp_evd[:, 0], 
                np.zeros_like(prin_comp_evd[:, 0]),
                c=labels)
    ax[1].set_title("PCA using EVD plot")
    fig.suptitle(f"Plots for reducing to {n_components}-D")
    plt.show()

  return prin_comp_sklearn, prin_comp_evd[:, :n_components]

Loading the data set, pre-processing and running the experiment:

Click to copy

dataset = datasets.load_iris()

X = dataset.data
mean = np.mean(X, axis=0)

# this was missing in your implementation
std = np.std(X, axis=0)
X_std = (X - mean) / std

for n in [3, 2, 1]:
  pca_comparison(X_std, n, dataset.target)

Results

enter image description here

3D plot is a bit cluttered, but if you look at the 2D and 1D cases, you'll notice the plots are the same if we multiply the first principal component by -1; scikit-learn PCA implementation uses Singular Value Decomposition under the hood, which will give non-unique solutions (see here).

Test:

Using the flip_signs() function from here

Click to copy

def flip_signs(A, B):
    """
    utility function for resolving the sign ambiguity in SVD
    http://stats.stackexchange.com/q/34396/115202
    """
    signs = np.sign(A) * np.sign(B)
    return A, B * signs

for n in [3, 2, 1]:
    sklearn_pca, evd = pca_comparison(X_std, n, dataset.target)
    assert np.allclose(*flip_signs(sklearn_pca, evd))

Issues in your implementation:

If we look at the data scales in the iris data set, the data are of different scales. This suggests that we should standardize the data (Read here for more)

Quoting a part of the above answer:

Continued by @ttnphns

When would one prefer to do PCA (or factor analysis or other similar type of analysis) on correlations (i.e. on z-standardized variables) instead of doing it on covariances (i.e. on centered variables)?

When the variables are different units of measurement. That's clear

...

In order to obtain the principal axes using e_values, e_vectors = np.linalg.eigh(X_cov), you should extract columns of e_vectors (documentation). You were extracting rows.

135

answered Oct 16 '22 14:10

akilat90

Related questions
                            
                                Understanding the output of mfcc
                            
                                Is sys.version_info reliable for Python version checking?
                            
                                TypeError: write() argument 1 must be unicode, not str
                            
                                What does the colour of the 'schedule' column in the airflow UI mean?
                            
                                Pojo like classes in Python
                            
                                Can generating permutations be done in parallel?
                            
                                no module named cairo - python and pip
                            
                                How to display an image from a numpy array in tkinter?
                            
                                pandas groupby transform custom function
                            
                                How can I use git repos as dependencies for my PyPi package?
                            
                                Pandas GroupBy and Calculate Z-Score [duplicate]
                            
                                Trouble modifying the language option in selenium python bindings
                            
                                Unable to solve "ImportError: dynamic module does not define module export function"
                            
                                How do I correctly set MYPYPATH to pick up stubs for mypy?
                            
                                pytorch embedding index out of range
                            
                                How to resolve inconsistent package warnings in conda?
                            
                                Make Python script combined with linux packages easy installable for end-user
                            
                                How do I see the time it took to run my program in Visual Studio Code?
                            
                                Non-overlapping rolling windows in pandas dataframes
                            
                                How to efficiently use CountVectorizer to get ngram counts for all files in a directory combined?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Implementing PCA with Numpy

Tags:

python

machine-learning

numpy

pca

Angie

People also ask

1 Answers

akilat90

Recent Activity

Donate For Us