def tdm_modify(feature_names,tdm):
non_useful_words=['kill','stampede','trigger','cause','death','hospital'\
,'minister','said','told','say','injury','victim','report']
indexes=[feature_names.index(word) for word in non_useful_words]
for index in indexes:
tdm[:,index]=0
return tdm
I want to manually set zero weights for some terms in tdm matrix. Using the above code I get the warning. I don't seem to understand why? Is there a better way to do this?
C:\Anaconda\lib\site-packages\scipy\sparse\compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
nnz returns the number of nonzero elements in a sparse matrix. nonzeros returns a column vector containing all the nonzero elements of a sparse matrix. nzmax returns the amount of storage space allocated for the nonzero entries of a sparse matrix.
The function csr_matrix() is used to create a sparse matrix of compressed sparse row format whereas csc_matrix() is used to create a sparse matrix of compressed sparse column format.
lil_matrix((M, N), [dtype]) to construct an empty matrix with shape (M, N) dtype is optional, defaulting to dtype='d'. Notes. Sparse matrices can be used in arithmetic operations: they support addition, subtraction, multiplication, division, and matrix power.
I ran into this warning message as well working on a machine learning problem. The exact application was constructing a document term matrix from a corpus of text. I agree with the accepted answer. I will add one empirical observation:
My exact task was to build a 25000 x 90000 matrix of uint8. My desired output was a sparse matrix compressed row format, i.e. csr_matrix.
The fastest way to do this by far, at the cost of using quite a bit more memory in the interim, was to initialize a dense matrix using np.zeros(), build it up, then do csr_matrix(dense_matrix) once at the end.
The second fastest way was to build up a lil_matrix, then convert it to csr_matrix with the .tocsr() method. This is recommended in the accepted answer. (Thank you hpaulj).
The slowest way was to assemble the csr_matrix element by element.
So to sum up, if you have enough working memory to build a dense matrix, and only want to end up with a sparse matrix later on for downstream efficiency, it might be faster to build up the matrix in dense format and then covert it once at the end. If you need to work in sparse format the whole time because of memory limitations, building up the matrix as a lil_matrix and then converting it (as in the accepted answer) is faster than building up a csr_matrix from the start.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With