I am trying to filter values smaller than 10 from a huge (1Mx1M) CSR matrix (SciPy). Since all my values are integers, dividing by 10 and remultiplying by 10 does the job, but I was wondering if there isn't a better way to go about filtering elements.
EDIT: The answer below works. Check that you have the latest version of SciPy.
You can also go with the less hacky, but probably slower:
m = m.multiply(m >= 10)
To understand what's going on:
>>> m = scipy.sparse.csr_matrix((1000, 1000), dtype=np.int)
>>> m[np.random.randint(0, 1000, 20),
np.random.randint(0, 1000, 20)] = np.random.randint(0, 100, 20)
>>> m.data
array([92, 46, 99, 24, 75, 16, 49, 60, 87, 64, 91, 37, 30, 32, 25, 40, 99,
9, 3, 84])
>>> m >= 10
<1000x1000 sparse matrix of type '<type 'numpy.bool_'>'
with 18 stored elements in Compressed Sparse Row format>
>>> m = m.multiply(m >= 10)
>>> m
<1000x1000 sparse matrix of type '<type 'numpy.int32'>'
with 18 stored elements in Compressed Sparse Row format>
>>> m.data
array([92, 46, 99, 24, 75, 16, 49, 60, 87, 64, 91, 37, 30, 32, 25, 40, 99,
84])
I think the version issue has to do with the implementation of the comparison operators. m >= 0
, uses a m.__gt__
. (I don't have an earlier version of scipy
to test this, but I believe there is one or more SO threads on the topic).
Something which might work in earlier version is:
m.data *= m.data>=10
m.eliminate_zeros()
In other words use a standard numpy
operation to set selected values to 0. The test could be a lot more complicated. And then use a standard sparse
function to clean it up. When you say, 'filter' that's essentially what you want to do, isn't it: set some values to zero and remove them from the sparse matrix?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With