Fast non-negative matrix factorization on large sparse matrix

Tags:

Using Scikit-learn (v 0.15.2) for non-negative matrix factorization on a large sparse matrix (less than 1% values > 0). I want to find factors by minimizing errors only on non-zero values of the matrix (i.e., do not calculate errors for entries that are zero), and to favor sparsity. I'm not sure if something's wrong with what I'm trying. The scikit-learn package's NMF and ProjectedGradientNMF have worked well for me before. But it seems that when the matrix size increases, the factorization is terribly slow.

I'm talking about matrices with > 10^10 cells. For matrix with ~10^7 cells, I find the executime time to be good.

The parameters I've used are as follows: nmf_model = NMF(n_components = 100, init='nndsvd', random_state=0, tol = 0.01, sparseness='data').

When I tried slightly different parameters (change to init=random), I get the following warning. After the warning, the execution of the script halts.

/lib/python2.7/site-packages/sklearn/decomposition/nmf.py:252: UserWarning: Iteration limit reached in nls subproblem.
  warnings.warn("Iteration limit reached in nls subproblem.")

Is there a way to make this faster and solve the above problem? I've tried using numpy sparse matrix (column- and row-sparse), but surprisingly - it's slower on the test I did with a smaller matrix (~10^7 cells).

Considering that one would have to run multiple iterations of such a factorization (to choose an ideal number of factors and k-fold cross validation), a faster way to solve this problem is highly desirable.

I'm also open to suggestions of package/tools that's not based on sklearn or Pyhon. I understand questions about package/tool choices are not encouraged, but for such a specific use-case, knowing what techniques others in the field use would be very helpful.

865

asked Jul 22 '15 07:07

vpk

2 Answers

Maybe a few words on what the initial problem is about could enable us to give better answers.

Matrix Factorization on a very large matrix is always going to be slow due to the nature of the problem.

Suggestions: Reducing n_components to < 20 will speed it up somewhat. However, the only real improvement in speed will be achieved by limiting the size of the matrix. With a matrix like you describe, one could think that you are trying to factorize a term frequency matrix. If this is so, you could try to use the vectorization functions in scikit-learn to limit the size of the matrix. Most of them have a max_features parameter. Example:

vectorizer = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1,2))
tfidf = vectorizer.fit_transform(data)

This will significantly speed up the problem solving.

Should I be completely wrong and this is not a term frequency problem, I would still look into ways to limit the initial matrix you are trying to factorize.

answered Oct 18 '22 07:10

wonderkid2

You might want to take a look at this article which discusses more recent techniques on NMF: http://www.cc.gatech.edu/~hpark/papers/nmf_blockpivot.pdf

The idea is to work only on the nonzero entries for factorization which reduces computational time especially when the matrix/matrices involved is/are very sparse.

Also, one of the authors from the same article created NMF implementations on github including the ones mentioned in their article. Here's the link: https://github.com/kimjingu/nonnegfac-python

Hope that helps.

answered Oct 18 '22 08:10

jtitusj

Related questions
                            
                                Why monitoring a keyboard interrupt in python thread doesn't work
                            
                                How to open chrome developer console using Selenium in Python?
                            
                                how to force buildout to use already installed package
                            
                                Instantiating a just-created class in a metaclass results in a RuntimeError ("super(): empty __class__ cell")
                            
                                Caffe: Drawing CNN Net
                            
                                Pandas: Applying Lambda to Multiple Data Frames
                            
                                What is the default value of j in (i:j:k) numpy slicing?
                            
                                Python Find highest row in a given column
                            
                                Python dictionary iterator performance
                            
                                Django ModelForm foreign key optgroup select
                            
                                Is there a significantly better way to find the most common word in a list (Python only)
                            
                                Paramiko, exec_command get the output stream continuously [duplicate]
                            
                                python mysql connector - What is the right way for multiple queries
                            
                                Can numpy argsort handle ties? [duplicate]
                            
                                Python Tornado Websocket Connections still open after being closed
                            
                                Arrow direction set by data, but length set by figure size
                            
                                Adaptive Bandwidth Kernel Density Estimation
                            
                                How to 'partially' install a Python package
                            
                                How to create a boxplot not showing the outliers using Python and Plotly?
                            
                                How to use keras for XOR

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast non-negative matrix factorization on large sparse matrix

Tags:

python

numpy

scikit-learn

sparse-matrix

matrix-factorization

vpk

People also ask

2 Answers

wonderkid2

jtitusj

Recent Activity

Donate For Us