Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Principal Component Analysis with Eigen Library

Tags:

c++

pca

eigen

I'm trying to compute the 2 major principal components from a dataset in C++ with Eigen.

The way I do it at the moment is to normalize the data between [0, 1] and then center the mean. After that I compute the covariance matrix and run an eigenvalue decomposition on it. I know SVD is faster, but I'm confused about the computed components.

Here is the major code about how I do it (where traindata is my MxN sized input matrix):

Eigen::VectorXf normalize(Eigen::VectorXf vec) {
  for (int i = 0; i < vec.size(); i++) { // normalize each feature.
      vec[i] = (vec[i] - minCoeffs[i]) / scalingFactors[i];
  }
  return vec;
}

// Calculate normalization coefficients (globals of type Eigen::VectorXf). 
maxCoeffs = traindata.colwise().maxCoeff();
minCoeffs = traindata.colwise().minCoeff();
scalingFactors = maxCoeffs - minCoeffs;

// For each datapoint.
for (int i = 0; i < traindata.rows(); i++) { // Normalize each datapoint.
  traindata.row(i) = normalize(traindata.row(i));
}

// Mean centering data.
Eigen::VectorXf featureMeans = traindata.colwise().mean();
Eigen::MatrixXf centered = traindata.rowwise() - featureMeans;

// Compute the covariance matrix.
Eigen::MatrixXf cov = centered.adjoint() * centered;
cov = cov / (traindata.rows() - 1);

Eigen::SelfAdjointEigenSolver<Eigen::MatrixXf> eig(cov);
// Normalize eigenvalues to make them represent percentages.
Eigen::VectorXf normalizedEigenValues =  eig.eigenvalues() / eig.eigenvalues().sum();


// Get the two major eigenvectors and omit the others.
Eigen::MatrixXf evecs = eig.eigenvectors();
Eigen::MatrixXf pcaTransform = evecs.rightCols(2);


// Map the dataset in the new two dimensional space.
traindata = traindata * pcaTransform;

The result of this code is something like this:

enter image description here

To confirm my results, I tried the same with WEKA. So what I did is to use the normalize and the center filter, in this order. Then the principal component filter and save + plot the output. The result is this:

enter image description here

Technically I should have done the same, however the outcome is so different. Can anyone see if I made a mistake?

like image 787
Chris Avatar asked Nov 04 '15 20:11

Chris


People also ask

What is eigenvalue in principal component analysis?

Eigenvalues represent the total amount of variance that can be explained by a given principal component. They can be positive or negative in theory, but in practice they explain variance which is always positive. If eigenvalues are greater than zero, then it's a good sign.

Are principal components eigenvalues?

The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the “core” of a PCA: The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude.

What are the principal components of eigen vector?

The main principal component, depicted by the black line, is the first Eigenvector. The second Eigenvector will be perpendicular or orthogonal to the first one. The reason the two Eigenvectors are orthogonal to each other is because the Eigenvectors should be able to span the whole x-y area.

Does PCA use Eigen Value decomposition?

Principal component analysis (PCA) is usually explained via an eigen-decomposition of the covariance matrix. However, it can also be performed via singular value decomposition (SVD) of the data matrix X.


2 Answers

When scaling to 0,1, you modify the local variable vec but forgot to update traindata.

Moreover, this can be done more easily this way:

RowVectorXf minCoeffs = traindata.colwise().maxCoeff();
RowVectorXf minCoeffs = traindata.colwise().minCoeff();
RowVectorXf scalingFactors = maxCoeffs - minCoeffs;
traindata = (traindata.rowwise()-minCoeffs).array().rowwise() / scalingFactors.array();

that is, using row-vectors and array features.

Let me also add that the symmetric eigenvalue decomposition is actually faster than SVD. The true advantage of SVD in this case is that it avoids squaring the entries, but since your input data are normalized and centered, and that you only care about the largest eigenvalues, there is no accuracy concern here.

like image 121
ggael Avatar answered Sep 26 '22 15:09

ggael


The reason was that Weka standardized the dataset. This means it scales each feature's variance to unit variance. When I did this, the plots looked the same. Technically my approach was correct as well.

like image 31
Chris Avatar answered Sep 25 '22 15:09

Chris