I would like to compare the output of an algorithm with different preprocessed data: NMF and PCA. In order to get somehow a comparable result, instead of choosing just the same number of components for each PCA and NMF, I would like to pick the amount that explains e.g 95% of retained variance.
I was wondering if its possible to identify the variance retained in each component of NMF.
For instance using PCA this would be given by:
retainedVariance(i) = eigenvalue(i) / sum(eigenvalue)
Any ideas?
It shows that NMF splits a face into a number of features that one could interpret as "nose", "eyes" etc, that you can combine to recreate the original image. PCA instead gives you "generic" faces ordered by how well they capture the original one.
In a sense, PCA is a kind of matrix factorization, since it decomposes a matrix X into WΣVT. However, matrix factorization is a very general term.
Nonnegative matrix factorization NMF is a linear powerful technique for dimension reduction. It reduces the dimensions of data making learning algorithms faster and more effective. Although NMF and its applications have been developed for more than a decade, they still have limitations in modeling and performance.
Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements.
You should loop over different n_components
and estimate explained_variance_score
of the decoded X
at each iteration. This will show you how many components do you need to explain 95% of variance.
Now I will explain why.
NMF and PCA, as many other unsupervised learning algorithms, are aimed to do two things:
X
into a compressed representation H
;H
back to X'
, which should be as close to X
as possible. They do it in a somehow similar way:
X' = dot(H, W)
, where W
is a learned matrix parameter. H = dot(X, V)
, where V
is also a learned parameter. In NMF, H = argmin(loss(X, H, W))
(with respect to H
only), where loss
is mean squared error between X
and dot(H, W)
, plus some additional penalties. Minimization is performed by coordinate descent, and result may be nonlinear in X
.k
th component minimizes residual MSE subject to being orthogonal with the previous components. NMF minimizes the same loss(X, H, W)
as when encoding, but now with respect to both H
and W
.If you want to measure performance of an encoding/decoding algorithm, you can follow the usual steps:
X_train
X_train'=decode(encode(X_train))
with X_train
using your preferred metric (e.g. MAE, RMSE, or explained variance)X_test
.Let's try it with PCA
and NMF
!
from sklearn import decomposition, datasets, model_selection, preprocessing, metrics
# use the well-known Iris dataset
X, _ = datasets.load_iris(return_X_y=True)
# split the dataset, to measure overfitting
X_train, X_test = model_selection.train_test_split(X, test_size=0.5, random_state=1)
# I scale the data in order to give equal importance to all its dimensions
# NMF does not allow negative input, so I don't center the data
scaler = preprocessing.StandardScaler(with_mean=False).fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)
# train the both decomposers
pca = decomposition.PCA(n_components=2).fit(X_train_sc)
nmf = decomposition.NMF(n_components=2).fit(X_train_sc)
print(sum(pca.explained_variance_ratio_))
It will print you explained variance ratio of 0.9536930834362043
- the default metric of PCA, estimated using its eigenvalues. We can measure it in a more direct way - by applying a metric to actual and "predicted" values:
def get_score(model, data, scorer=metrics.explained_variance_score):
""" Estimate performance of the model on the data """
prediction = model.inverse_transform(model.transform(data))
return scorer(data, prediction)
print('train set performance')
print(get_score(pca, X_train_sc))
print(get_score(nmf, X_train_sc))
print('test set performance')
print(get_score(pca, X_test_sc))
print(get_score(nmf, X_test_sc))
which gives
train set performance
0.9536930834362043 # same as before!
0.937291711378812
test set performance
0.9597828443047842
0.9590555069007827
You can see that on the training set PCA performs better than NMF, but on the test set their performance is almost identical. This happens, because NMF applies lots of regularization:
H
and W
(the learned parameter) must be non-negativeH
should be as small as possible (L1 and L2 penalties)W
should be as small as possible (L1 and L2 penalties)These regularizations make NMF fit worse than possible to the training data, but they might improve its generalizing ability, which happened in our case.
In PCA, it is simple, because its components h_1, h_2, ... h_k
are learned sequentially. If you add the new component h_(k+1)
, the first k
will not change. Thus, you can estimate performance of each component, and these estimates will not depent on the number of components. This makes it possible for PCA to output the explained_variance_ratio_
array after only a single fit to data.
NMF is more complex, because all its components are trained at the same time, and each one depends on all the rest. Thus, if you add the k+1
th component, the first k
components will change, and you cannot match each particular component with its explained variance (or any other metric).
But what you can to is to fit a new instance of NMF
for each number of components, and compare the total explained variance:
ks = [1,2,3,4]
perfs_train = []
perfs_test = []
for k in ks:
nmf = decomposition.NMF(n_components=k).fit(X_train_sc)
perfs_train.append(get_score(nmf, X_train_sc))
perfs_test.append(get_score(nmf, X_test_sc))
print(perfs_train)
print(perfs_test)
which would give
[0.3236945680665101, 0.937291711378812, 0.995459457205891, 0.9974027602663655]
[0.26186701106012833, 0.9590555069007827, 0.9941424954209546, 0.9968456603914185]
Thus, three components (judging by the train set performance) or two components (by the test set) are required to explain at least 95% of variance. Please notice that this case is unusual and caused by a small size of training and test data: usually performance degrades a little bit on the test set, but in my case it actually improved a little.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With