Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare predictive power of PCA and NMF

I would like to compare the output of an algorithm with different preprocessed data: NMF and PCA. In order to get somehow a comparable result, instead of choosing just the same number of components for each PCA and NMF, I would like to pick the amount that explains e.g 95% of retained variance.

I was wondering if its possible to identify the variance retained in each component of NMF.

For instance using PCA this would be given by: retainedVariance(i) = eigenvalue(i) / sum(eigenvalue)

Any ideas?

like image 441
Phil D Avatar asked Jan 08 '18 10:01

Phil D


People also ask

What is the difference between NMF and PCA?

It shows that NMF splits a face into a number of features that one could interpret as "nose", "eyes" etc, that you can combine to recreate the original image. PCA instead gives you "generic" faces ordered by how well they capture the original one.

Is PCA matrix factorization?

In a sense, PCA is a kind of matrix factorization, since it decomposes a matrix X into WΣVT. However, matrix factorization is a very general term.

Is NMF dimensionality reduction?

Nonnegative matrix factorization NMF is a linear powerful technique for dimension reduction. It reduces the dimensions of data making learning algorithms faster and more effective. Although NMF and its applications have been developed for more than a decade, they still have limitations in modeling and performance.

What is NMF algorithm?

Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements.


1 Answers

TL;DR

You should loop over different n_components and estimate explained_variance_score of the decoded X at each iteration. This will show you how many components do you need to explain 95% of variance.

Now I will explain why.

Relationship between PCA and NMF

NMF and PCA, as many other unsupervised learning algorithms, are aimed to do two things:

  • encode input X into a compressed representation H;
  • decode H back to X', which should be as close to X as possible.

They do it in a somehow similar way:

  • Decoding is similar in PCA and NMF: they output X' = dot(H, W), where W is a learned matrix parameter.
  • Encoding is different. In PCA, it is also linear: H = dot(X, V), where V is also a learned parameter. In NMF, H = argmin(loss(X, H, W)) (with respect to H only), where loss is mean squared error between X and dot(H, W), plus some additional penalties. Minimization is performed by coordinate descent, and result may be nonlinear in X.
  • Training is also different. PCA learns sequentially: the first component minimizes MSE without constraints, each next kth component minimizes residual MSE subject to being orthogonal with the previous components. NMF minimizes the same loss(X, H, W) as when encoding, but now with respect to both H and W.

How to measure performance of dimensionality reduction

If you want to measure performance of an encoding/decoding algorithm, you can follow the usual steps:

  1. Train your encoder+decoder on X_train
  2. To measure in-sample performance, compare X_train'=decode(encode(X_train)) with X_train using your preferred metric (e.g. MAE, RMSE, or explained variance)
  3. To measure out-of-sample performance (generalizing ability) of your algorithm, do step 2 with the unseen X_test.

Let's try it with PCA and NMF!

from sklearn import decomposition, datasets, model_selection, preprocessing, metrics
# use the well-known Iris dataset
X, _ = datasets.load_iris(return_X_y=True)
# split the dataset, to measure overfitting
X_train, X_test = model_selection.train_test_split(X, test_size=0.5, random_state=1)
# I scale the data in order to give equal importance to all its dimensions
# NMF does not allow negative input, so I don't center the data
scaler = preprocessing.StandardScaler(with_mean=False).fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)
# train the both decomposers
pca = decomposition.PCA(n_components=2).fit(X_train_sc)
nmf = decomposition.NMF(n_components=2).fit(X_train_sc)
print(sum(pca.explained_variance_ratio_))

It will print you explained variance ratio of 0.9536930834362043 - the default metric of PCA, estimated using its eigenvalues. We can measure it in a more direct way - by applying a metric to actual and "predicted" values:

def get_score(model, data, scorer=metrics.explained_variance_score):
    """ Estimate performance of the model on the data """
    prediction = model.inverse_transform(model.transform(data))
    return scorer(data, prediction)

print('train set performance')
print(get_score(pca, X_train_sc))
print(get_score(nmf, X_train_sc))

print('test set performance')
print(get_score(pca, X_test_sc))
print(get_score(nmf, X_test_sc))

which gives

train set performance
0.9536930834362043 # same as before!
0.937291711378812 
test set performance
0.9597828443047842
0.9590555069007827

You can see that on the training set PCA performs better than NMF, but on the test set their performance is almost identical. This happens, because NMF applies lots of regularization:

  • H and W (the learned parameter) must be non-negative
  • H should be as small as possible (L1 and L2 penalties)
  • W should be as small as possible (L1 and L2 penalties)

These regularizations make NMF fit worse than possible to the training data, but they might improve its generalizing ability, which happened in our case.

How to choose the number of components

In PCA, it is simple, because its components h_1, h_2, ... h_k are learned sequentially. If you add the new component h_(k+1), the first k will not change. Thus, you can estimate performance of each component, and these estimates will not depent on the number of components. This makes it possible for PCA to output the explained_variance_ratio_ array after only a single fit to data.

NMF is more complex, because all its components are trained at the same time, and each one depends on all the rest. Thus, if you add the k+1th component, the first k components will change, and you cannot match each particular component with its explained variance (or any other metric).

But what you can to is to fit a new instance of NMF for each number of components, and compare the total explained variance:

ks = [1,2,3,4]
perfs_train = []
perfs_test = []
for k in ks:
    nmf = decomposition.NMF(n_components=k).fit(X_train_sc)
    perfs_train.append(get_score(nmf, X_train_sc))
    perfs_test.append(get_score(nmf, X_test_sc))
print(perfs_train)
print(perfs_test)

which would give

[0.3236945680665101, 0.937291711378812, 0.995459457205891, 0.9974027602663655]
[0.26186701106012833, 0.9590555069007827, 0.9941424954209546, 0.9968456603914185]

Thus, three components (judging by the train set performance) or two components (by the test set) are required to explain at least 95% of variance. Please notice that this case is unusual and caused by a small size of training and test data: usually performance degrades a little bit on the test set, but in my case it actually improved a little.

like image 181
David Dale Avatar answered Sep 28 '22 19:09

David Dale