Sparse Efficiency Warning while changing the column

Tags:

def tdm_modify(feature_names,tdm):
    non_useful_words=['kill','stampede','trigger','cause','death','hospital'\
        ,'minister','said','told','say','injury','victim','report']
    indexes=[feature_names.index(word) for word in non_useful_words]
    for index in indexes:
        tdm[:,index]=0   
    return tdm

I want to manually set zero weights for some terms in tdm matrix. Using the above code I get the warning. I don't seem to understand why? Is there a better way to do this?

C:\Anaconda\lib\site-packages\scipy\sparse\compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)

371

asked Oct 12 '15 22:10

Abhishek Bhatia

1 Answers

I ran into this warning message as well working on a machine learning problem. The exact application was constructing a document term matrix from a corpus of text. I agree with the accepted answer. I will add one empirical observation:

My exact task was to build a 25000 x 90000 matrix of uint8. My desired output was a sparse matrix compressed row format, i.e. csr_matrix.

The fastest way to do this by far, at the cost of using quite a bit more memory in the interim, was to initialize a dense matrix using np.zeros(), build it up, then do csr_matrix(dense_matrix) once at the end.

The second fastest way was to build up a lil_matrix, then convert it to csr_matrix with the .tocsr() method. This is recommended in the accepted answer. (Thank you hpaulj).

The slowest way was to assemble the csr_matrix element by element.

So to sum up, if you have enough working memory to build a dense matrix, and only want to end up with a sparse matrix later on for downstream efficiency, it might be faster to build up the matrix in dense format and then covert it once at the end. If you need to work in sparse format the whole time because of memory limitations, building up the matrix as a lil_matrix and then converting it (as in the accepted answer) is faster than building up a csr_matrix from the start.

147

answered Sep 29 '22 17:09

Michael S. Emanuel

Related questions
                            
                                Equivalent of Python's Pass in Scala
                            
                                Python - Split a List into 2 by even or odd index?
                            
                                How to fix a : TypeError 'tuple' object does not support item assignment
                            
                                Python PIL NameError global name Image is not defined
                            
                                What's the fastest way to locate a list element within a list in python?
                            
                                python numpy weighted average with nans
                            
                                How to count number of records in an SQL database with python
                            
                                Python S3 download zip file
                            
                                PyCharm Python project No such file or directory
                            
                                List comprehension with condition
                            
                                Understanding thread.join(timeout)
                            
                                Selenium test with Python in Internet Explorer
                            
                                Python's argparse: How to use keyword as argument's name
                            
                                Django : Convert UTC to local time zone in 'Views'
                            
                                pandas column names to list
                            
                                Python default logger disabled
                            
                                django rest framework - using detail_route and detail_list
                            
                                How to store money in elasticsearch
                            
                                Python exception for HTTP response codes
                            
                                polynomial regression using python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sparse Efficiency Warning while changing the column

Tags:

python

numpy

scipy

nlp

Abhishek Bhatia

People also ask

1 Answers

Michael S. Emanuel

Recent Activity

Donate For Us