Possible Duplicate:
MATLAB is running out of memory but it should not be
I want to perform PCA analysis on a huge data set of points. To be more specific, I have size(dataPoints) = [329150 132]
where 328150
is the number of data points and 132
are the number of features.
I want to extract the eigenvectors and their corresponding eigenvalues so that I can perform PCA reconstruction.
However, when I am using the princomp
function (i.e. [eigenVectors projectedData eigenValues] = princomp(dataPoints);
I obtain the following error :
>> [eigenVectors projectedData eigenValues] = princomp(pointsData);
Error using svd
Out of memory. Type HELP MEMORY for your options.
Error in princomp (line 86)
[U,sigma,coeff] = svd(x0,econFlag); % put in 1/sqrt(n-1) later
However, if I am using a smaller data set, I have no problem.
How can I perform PCA on my whole dataset in Matlab? Have someone encountered this problem?
Edit:
I have modified the princomp
function and tried to use svds
instead of svd
, but however, I am obtaining pretty much the same error. I have dropped the error bellow :
Error using horzcat
Out of memory. Type HELP MEMORY for your options.
Error in svds (line 65)
B = [sparse(m,m) A; A' sparse(n,n)];
Error in princomp (line 86)
[U,sigma,coeff] = svds(x0,econFlag); % put in 1/sqrt(n-1) later
Principal component analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
coeff = pca( X ) returns the principal component coefficients, also known as loadings, for the n-by-p data matrix X . Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is p-by-p.
Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed.
Solution based on Eigen Decomposition
You can first compute PCA on X'X
as @david said. Specifically, see the script below:
sz = [329150 132];
X = rand(sz);
[V D] = eig(X.' * X);
Actually, V
holds the right singular vectors, and it holds the principal vectors if you put your data vectors in rows. The eigenvalues, D
, are the variances among each direction. The singular vectors, which are the standard deviations, are computed as the square root of the variances:
S = sqrt(D);
Then, the left singular vectors, U
, are computed using the formula X = USV'
. Note that U
refers to the principal components if your data vectors are in columns.
U = X*V*S^(-1);
Let us reconstruct the original data matrix and see the L2 reconstruction error:
X2 = U*S*V';
L2ReconstructionError = norm(X(:)-X2(:))
It is almost zero:
L2ReconstructionError =
6.5143e-012
If your data vectors are in columns and you want to convert your data into eigenspace coefficients, you should do U.'*X
.
This code snippet takes around 3 seconds in my moderate 64-bit desktop.
Solution based on Randomized PCA
Alternatively, you can use a faster approximate method which is based on randomized PCA. Please see my answer in Cross Validated. You can directly compute fsvd
and get U
and V
instead of using eig
.
You may employ randomized PCA if the data size is too big. But, I think the previous way is sufficient for the size you gave.
My guess is that you have a huge data set. You don't need all of the svd coefficients. In this case, use svds
instead of svd
:
Taken directly from Matlab help:
s = svds(A,k) computes the k largest singular values and associated singular vectors of matrix A.
From your question, I understand that you don't call svd
directly. But you might as well take a look at princomp
(It is editable!) and alter the line that calls it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With