I have two scipy_sparse_csr_matrix 'a' and scipy_sparse_csr_matrix(boolean) 'mask', and I want to set elements of 'a' to zero where element of mask is True.
for example
>>>a
<3x3 sparse matrix of type '<type 'numpy.int32'>'
with 4 stored elements in Compressed Sparse Row format>
>>>a.todense()
matrix([[0, 0, 3],
[0, 1, 5],
[7, 0, 0]])
>>>mask
<3x3 sparse matrix of type '<type 'numpy.bool_'>'
with 4 stored elements in Compressed Sparse Row format>
>>>mask.todense()
matrix([[ True, False, True],
[False, False, True],
[False, True, False]], dtype=bool)
Then I want to obtain the following result.
>>>result
<3x3 sparse matrix of type '<type 'numpy.int32'>'
with 2 stored elements in Compressed Sparse Row format>
>>>result.todense()
matrix([[0, 0, 0],
[0, 1, 0],
[7, 0, 0]])
I can do it by operation like
result = a - a.multiply(mask)
or
a -= a.multiply(mask) #I don't care either in-place or copy.
But I think above operations are inefficient. Since actual shape of 'a' and 'mask' are 67,108,864 × 2,000,000, these operations take several seconds on high spec server(64 core Xeon cpu, 512GB memory). For example, 'a' has about 30,000,000 non-zero elements, and 'mask' has about 1,800,000 non-zero(True) elements, then above operation take about 2 seconds.
Is there more efficient way to do this?
Conditions are below.
Thanks!
Other way(tried)
a.data*=~np.array(mask[a.astype(np.bool)]).flatten();a.eliminate_zeros() #This takes twice the time longer than above method.
My initial impression is that this multiply and subtract approach is a reasonable one. Quite often sparse
code implements operations as some sort of multiplication, even if the dense equivalents use more direct methods. The sparse sum over rows or columns uses a matrix multiplication with the appropriate row or column matrix of 1s. Even row or column indexing uses matrix multiplication (at least on the csr
format).
Sometimes we can improve on operations by working directly with the matrix attributes (data
, indices
, indptr
). But that requires a lot more thought and experimentation.
For the dense arrays my first try would be
In [611]: a.A*~(mask.A)
Out[611]:
array([[0, 0, 0],
[0, 1, 0],
[7, 0, 0]], dtype=int32)
But there isn't a direct way of doing not
to a sparse matrix. If mask
was indeed sparse, ~mask
would not be. In your example mask
has 4 True terms, and 5 False, so a dense version would work just as well:
In [612]: nmask=sparse.csr_matrix(~(mask.A))
In [615]: a.multiply(nmask)
Out[615]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in Compressed Sparse Row format>
CSR scipy matrix does not update after updating its values explores setting the diagonal of a sparse matrix to 0. It is possible to set values of the data
attribute to 0, and then eliminate_zeros
once at the end.
The other dense method is
In [618]: a1=a.A
In [619]: a1[mask.A]=0
This also works in sparse
- sort of
In [622]: a2=a.copy()
In [624]: a2[mask]
Out[624]: matrix([[0, 3, 5, 0]], dtype=int32)
In [625]: a2[mask]=0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
SparseEfficiencyWarning)
In [626]: a2
Out[626]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 6 stored elements in Compressed Sparse Row format>
As noted in the previous question, we need to eliminate the zeros:
In [628]: a2.eliminate_zeros()
In [629]: a2
Out[629]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in Compressed Sparse Row format>
Taking a hint from the sparsity warning let's try the lil
format
In [638]: al=a.tolil()
In [639]: al[mask]
Out[639]:
<1x4 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in LInked List format>
In [640]: al[mask]=0
In [641]: al
Out[641]:
<3x3 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in LInked List format>
It's interesting that al[mask]
is still sparse, where as a[mask]
is dense. The 2 formats use different indexing methods.
At some low level of sparsity, it might be worth iterating over the True (nonzero) elements of mask
, setting the corresponding terms of a
to zero directly.
I'm not going to guess as to the relative speeds of these methods. That needs to be tested on realistic data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With