Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

saving large data set PCA on disk for later use with limited disc space

I have a very large data set (numpy array) that I do a PCA on for dimensionality reduction. The data set is called train_data. I use scikit-learn and do this like this

pca = PCA(n_components=1000, svd_solver='randomized')
pca.fit()
smaller_data = pca.transform(train_data)

I have a second data set called test_data that I want to use the same transformations on, like this:

smaller_test = pca.transform(test_data)

However, between these two steps I need to save the model to disk.

According to the scikit documentation, I can do this with pickle

pickle.dump(pca, open( "pca.p", "wb" ) )

but this pickle file is way too large for my limited disk space.

The reduced data set smaller_data is of acceptable size to be saved as a .npy file:

np.save('train_data_pca.npy', train_data)

How can I use this file to do a transform(test_data), or make the saved pca pickle smaller? Zipping using the gzip package is not enough, I tried that.

like image 867
spore234 Avatar asked Feb 27 '17 19:02

spore234


1 Answers

I found a way, it is actually pretty easy after looking into the source code of the transform method in scikit.

I have to save the components means

means = pca.means_   #put this into a .npy file

and then it is just matrix multiplication:

from sklearn.utils.extmath import fast_dot
td = test_data - means
tdd = fast_dot(td, pca.components_.T)

yields the same as

pca.transform(test_data)
like image 96
spore234 Avatar answered Nov 08 '22 19:11

spore234