I have a very large data set (numpy array) that I do a PCA on for dimensionality reduction. The data set is called train_data
. I use scikit-learn and do this like this
pca = PCA(n_components=1000, svd_solver='randomized')
pca.fit()
smaller_data = pca.transform(train_data)
I have a second data set called test_data
that I want to use the same transformations on, like this:
smaller_test = pca.transform(test_data)
However, between these two steps I need to save the model to disk.
According to the scikit documentation, I can do this with pickle
pickle.dump(pca, open( "pca.p", "wb" ) )
but this pickle file is way too large for my limited disk space.
The reduced data set smaller_data
is of acceptable size to be saved as a .npy
file:
np.save('train_data_pca.npy', train_data)
How can I use this file to do a transform(test_data), or make the saved pca pickle smaller? Zipping using the gzip package is not enough, I tried that.
I found a way, it is actually pretty easy after looking into the source code of the transform
method in scikit.
I have to save the components means
means = pca.means_ #put this into a .npy file
and then it is just matrix multiplication:
from sklearn.utils.extmath import fast_dot
td = test_data - means
tdd = fast_dot(td, pca.components_.T)
yields the same as
pca.transform(test_data)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With