I trying to do a simple principal component analysis with matplotlib.mlab.PCA
but with the attributes of the class I can't get a clean solution to my problem. Here's an example:
Get some dummy data in 2D and start PCA:
from matplotlib.mlab import PCA
import numpy as np
N = 1000
xTrue = np.linspace(0,1000,N)
yTrue = 3*xTrue
xData = xTrue + np.random.normal(0, 100, N)
yData = yTrue + np.random.normal(0, 100, N)
xData = np.reshape(xData, (N, 1))
yData = np.reshape(yData, (N, 1))
data = np.hstack((xData, yData))
test2PCA = PCA(data)
Now, I just want to get the principal components as vectors in my original coordinates and plot them as arrows onto my data.
What is a quick and clean way to get there?
Thanks, Tyrax
Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality reduction in machine learning. It is a statistical process that converts the observations of correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation.
Plotting PCA While talking about plotting a PCA we generally refer to a scatterplot of the first two principal components PC1 and PC2. These plots reveal the features of data such as non-linearity and departure from normality. PC1 and PC2 are evaluated for each sample vector and plotted.
1. A PCA plot shows clusters of samples based on their similarity. PCA does not discard any samples or characteristics (variables). Instead, it reduces the overwhelming number of dimensions by constructing principal components (PCs).
I don't think the mlab.PCA
class is appropriate for what you want to do. In particular, the PCA
class rescales the data before finding the eigenvectors:
a = self.center(a)
U, s, Vh = np.linalg.svd(a, full_matrices=False)
The center
method divides by sigma
:
def center(self, x):
'center the data using the mean and sigma from training set a'
return (x - self.mu)/self.sigma
This results in eigenvectors, pca.Wt
, like this:
[[-0.70710678 -0.70710678]
[-0.70710678 0.70710678]]
They are perpendicular, but not directly relevant to the principal axes of your original data. They are principal axes with respect to massaged data.
Perhaps it might be easier to code what you want directly (without the use of the mlab.PCA
class):
import numpy as np
import matplotlib.pyplot as plt
N = 1000
xTrue = np.linspace(0, 1000, N)
yTrue = 3 * xTrue
xData = xTrue + np.random.normal(0, 100, N)
yData = yTrue + np.random.normal(0, 100, N)
xData = np.reshape(xData, (N, 1))
yData = np.reshape(yData, (N, 1))
data = np.hstack((xData, yData))
mu = data.mean(axis=0)
data = data - mu
# data = (data - mu)/data.std(axis=0) # Uncommenting this reproduces mlab.PCA results
eigenvectors, eigenvalues, V = np.linalg.svd(data.T, full_matrices=False)
projected_data = np.dot(data, eigenvectors)
sigma = projected_data.std(axis=0).mean()
print(eigenvectors)
fig, ax = plt.subplots()
ax.scatter(xData, yData)
for axis in eigenvectors:
start, end = mu, mu + sigma * axis
ax.annotate(
'', xy=end, xycoords='data',
xytext=start, textcoords='data',
arrowprops=dict(facecolor='red', width=2.0))
ax.set_aspect('equal')
plt.show()
Note that matplotlib.mlab.PCA
was removed in 3.1.
Below are three alternative PCA implementations, one based on the lastmatplotlib.mlab.PCA
implementation, one based on unutbu's answer, and one based on doug's answer to another question.
The first two use singular value decomposition (svd
) to obtain the eigenvalues and eigenvectors, the latter uses a covariance matrix (cov
) approach.
A magnificent explanation of the relation between the svd
and cov
approaches is found here.
The implementations have been simplified and refactored for easy comparison.
def pca_svd(data):
""" based on matplotlib.mlab.PCA with standardize=False """
data -= data.mean(axis=0)
__, singular_values, eigenvectors_transposed = numpy.linalg.svd(
data, full_matrices=False)
eigenvalues = singular_values ** 2 / (data.shape[0] - 1)
eigenvectors = eigenvectors_transposed.T
transformed_data = numpy.dot(data, eigenvectors)
return transformed_data, eigenvalues, eigenvectors
def pca_svd_transposed(data):
""" based on unutbu's answer """
data -= data.mean(axis=0)
eigenvectors, singular_values, __ = numpy.linalg.svd(
data.T, full_matrices=False) # note data transposed
eigenvalues = singular_values ** 2 / (data.shape[0] - 1)
transformed_data = numpy.dot(data, eigenvectors)
return transformed_data, eigenvalues, eigenvectors
def pca_cov(data):
""" based on doug's answer """
data -= data.mean(axis=0)
covariance_matrix = numpy.cov(data, rowvar=False)
eigenvalues, eigenvectors = scipy.linalg.eigh(covariance_matrix)
decreasing_order = numpy.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[decreasing_order]
eigenvectors = eigenvectors[:, decreasing_order]
transformed_data = numpy.dot(data, eigenvectors)
return transformed_data, eigenvalues, eigenvectors
The eigenvalues
represent the variance of the data along the principal axes, i.e. the variance of the transformed_data
.
Timing using timeit
reveals the following on my system:
array shape: (15000, 4)
iterations: 1000
pca_svd_transposed: 4.32 s (average 4.32 ms)
pca_svd: 1.87 s (average 1.87 ms)
pca_cov: 1.41 s (average 1.41 ms)
Note that the svd
of the transposed input array is relatively slow for this array shape.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With