I look for a NMF implementation that has a python interface, and handles both missing data and zeros.
I don't want to impute my missing values before starting the factorization, I want them to be ignored in the minimized function.
It seems that neither scikit-learn, nor nimfa, nor graphlab, nor mahout propose such an option.
Thanks!
Nonnegative matrix factorization (NMF) has become a widely used tool for the analysis of high dimensional data as it automatically extracts sparse and meaningful features from a set of nonnegative data vectors.
In the NMF factorization, the parameter k (noted r in most literature) is the rank of the approximation of V and is chosen such that k<min(m,n). The choice of the parameter determines the representation of your data V in an over-complete basis composed of the columns of W; the wi , i=1,2,⋯,k .
Matrix factorization is a simple embedding model. Given the feedback matrix A ∈ R m × n , where is the number of users (or queries) and is the number of items, the model learns: A user embedding matrix U ∈ R m × d , where row i is the embedding for user i.
Using this Matlab to python code conversion sheet I was able to rewrite NMF from Matlab toolbox library.
I had to decompose a 40k X 1k
matrix with sparsity of 0.7%. Using 500 latent features my machine took 20 minutes for 100 iteration.
Here is the method:
import numpy as np
from scipy import linalg
from numpy import dot
def nmf(X, latent_features, max_iter=100, error_limit=1e-6, fit_error_limit=1e-6):
"""
Decompose X to A*Y
"""
eps = 1e-5
print 'Starting NMF decomposition with {} latent features and {} iterations.'.format(latent_features, max_iter)
X = X.toarray() # I am passing in a scipy sparse matrix
# mask
mask = np.sign(X)
# initial matrices. A is random [0,1] and Y is A\X.
rows, columns = X.shape
A = np.random.rand(rows, latent_features)
A = np.maximum(A, eps)
Y = linalg.lstsq(A, X)[0]
Y = np.maximum(Y, eps)
masked_X = mask * X
X_est_prev = dot(A, Y)
for i in range(1, max_iter + 1):
# ===== updates =====
# Matlab: A=A.*(((W.*X)*Y')./((W.*(A*Y))*Y'));
top = dot(masked_X, Y.T)
bottom = (dot((mask * dot(A, Y)), Y.T)) + eps
A *= top / bottom
A = np.maximum(A, eps)
# print 'A', np.round(A, 2)
# Matlab: Y=Y.*((A'*(W.*X))./(A'*(W.*(A*Y))));
top = dot(A.T, masked_X)
bottom = dot(A.T, mask * dot(A, Y)) + eps
Y *= top / bottom
Y = np.maximum(Y, eps)
# print 'Y', np.round(Y, 2)
# ==== evaluation ====
if i % 5 == 0 or i == 1 or i == max_iter:
print 'Iteration {}:'.format(i),
X_est = dot(A, Y)
err = mask * (X_est_prev - X_est)
fit_residual = np.sqrt(np.sum(err ** 2))
X_est_prev = X_est
curRes = linalg.norm(mask * (X - X_est), ord='fro')
print 'fit residual', np.round(fit_residual, 4),
print 'total residual', np.round(curRes, 4)
if curRes < error_limit or fit_residual < fit_error_limit:
break
return A, Y
Here I was using Scipy sparse matrix as input and missing values were converted to 0
using toarray()
method. Therefore, the mask was created using numpy.sign()
function. However, if you have nan
values you could get same results by using numpy.isnan()
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With