I am using PCA to reduce the dimensionality of a N-dimensional dataset, but I want to build in robustness to large outliers, so I've been looking into Robust PCA codes.
For traditional PCA, I'm using python's sklearn.decomposition.PCA which nicely returns the principal components as vectors, onto which I can then project my data (to be clear, I've also coded my own versions using SVD so I know how the method works). I found a few pre-coded RPCA python codes out there (like https://github.com/dganguli/robust-pca and https://github.com/jkarnows/rpcaADMM).
The 1st code is based on the Candes et al. (2009) method, and returns low rank L and sparse S matrices for a dataset D. The 2nd code uses the ADMM method of matrix decomposition (Parikh, N., & Boyd, S. 2013) and returns X_1, X_2, X_3 matrices. I must admit, I'm having a very hard time figuring out how to connect these to the principal axes that are returned by a standard PCM algorithm. Can anyone provide any guidance?
Specifically, in one dataset X, I have a cloud of N 3-D points. I run it through PCA:
pca=sklean.decompose.PCA(n_components=3)
pca.fit(X)
comps=pca.components_
and these 3 components are 3-D vectors define the new basis onto which I project all my points. With Robust PCA, I get matrices L+S=X. Does one then run pca.fit(L)? I would have thought that RPCA would have given me back the eigenvectors but have internal steps to throw out outliers as part of building the covariance matrix or performing SVD.
Maybe what I think of as "Robust PCA" isn't how other people are using/coding it?
Eigenvalues are coefficients applied to eigenvectors that give the vectors their length or magnitude. So, PCA is a method that: Measures how each variable is associated with one another using a Covariance matrix. Understands the directions of the spread of our data using Eigenvectors.
There are three common approaches in helping to make this decision: Eigenvalue criterion. Proportion of variance explained criterion. Scree plot criterion.
Eigenvalues represent the total amount of variance that can be explained by a given principal component. They can be positive or negative in theory, but in practice they explain variance which is always positive. If eigenvalues are greater than zero, then it's a good sign.
The robust-pca
code factors the data matrix D
into two matrices, L
and S
which are "low-rank" and "sparse" matrices (see the paper for details). L
is what's mostly constant between the various observations, while S
is what varies. Figures 2 and 3 in the paper give a really nice example from a couple of security cameras, picking out the static background (L
) and variability such as passing people (S
).
If you just want the eigenvectors, treat the S
as junk (the "large outliers" you're wanting to clip out) and do an eigenanalysis on the L
matrix.
Here's an example using the robust-pca
code:
L, S = RPCA(data).fit()
rcomp, revals, revecs = pca(L)
print("Normalised robust eigenvalues: %s" % (revals/np.sum(revals),))
Here, the pca
function is:
def pca(data, numComponents=None):
"""Principal Components Analysis
From: http://stackoverflow.com/a/13224592/834250
Parameters
----------
data : `numpy.ndarray`
numpy array of data to analyse
numComponents : `int`
number of principal components to use
Returns
-------
comps : `numpy.ndarray`
Principal components
evals : `numpy.ndarray`
Eigenvalues
evecs : `numpy.ndarray`
Eigenvectors
"""
m, n = data.shape
data -= data.mean(axis=0)
R = np.cov(data, rowvar=False)
# use 'eigh' rather than 'eig' since R is symmetric,
# the performance gain is substantial
evals, evecs = np.linalg.eigh(R)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
if numComponents is not None:
evecs = evecs[:, :numComponents]
# carry out the transformation on the data using eigenvectors
# and return the re-scaled data, eigenvalues, and eigenvectors
return np.dot(evecs.T, data.T).T, evals, evecs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With