Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PCA memory error in Sklearn: Alternative Dim Reduction?

I am trying to reduce the dimensionality of a very large matrix using PCA in Sklearn, but it produces a memory error (RAM required exceeds 128GB). I have already set copy=False and I'm using the less computationally expensive randomised PCA.

Is there a workaround? If not, what other dim reduction techniques could I use that require less memory. Thank you.


Update: the matrix I am trying to PCA is a set of feature vectors. It comes from passing a set of training images through a pretrained CNN. The matrix is [300000, 51200]. PCA components tried: 100 to 500.

I want to reduce its dimensionality so I can use these features to train an ML algo, such as XGBoost. Thank you.

like image 331
Chris Parry Avatar asked Apr 11 '17 22:04

Chris Parry


2 Answers

You Could use IncrementalPCA available in SK learn. from sklearn.decomposition import IncrementalPCA. Rest of the interface is same as PCA. You need to pass an extra argument batch_size, which needs to <= #components.

However, in case there is a need to apply a non linear version like KernelPCA there does not seem to be a support for the something similar. KernelPCA absolutely explodes in it's memory requirement, see this article about Non Linear Dimensionality Reduction on Wikipedia

like image 188
Vivek Puurkayastha Avatar answered Nov 10 '22 01:11

Vivek Puurkayastha


In the end, I used TruncatedSVD instead of PCA, which is capable of handling large matrices without memory issues:

from sklearn import decomposition

n_comp = 250
svd = decomposition.TruncatedSVD(n_components=n_comp, algorithm='arpack')
svd.fit(train_features)
print(svd.explained_variance_ratio_.sum())

train_features = svd.transform(train_features)
test_features = svd.transform(test_features)
like image 45
Chris Parry Avatar answered Nov 10 '22 00:11

Chris Parry