Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boolean operations on scipy.sparse matrices

I have a set of sparse matrices filled with boolean values that I need to perform logical operations on (mostly element-wise OR).

as in numpy, summing matrices with dtype='bool' gives the element-wise OR, however there's a nasty side-effect:

>>> from scipy import sparse
>>> [a,b] = [sparse.rand(5,5,density=0.1,format='lil').astype('bool')
...  for x in range(2)]
>>> b
<5x5 sparse matrix of type '<class 'numpy.bool_'>'
    with 2 stored elements in LInked List format>
>>> a+b
<5x5 sparse matrix of type '<class 'numpy.int8'>'
    with 4 stored elements in Compressed Sparse Row format>

The data type gets changed to 'int8', which causes problems for future operations. This could be gotten around with by saying:

(a+b).astype('bool')

But I get the impression that all this type changing would cause a performance hit.

Why is the dtype of the result different from the operands?
And is there a better way to do logical operations on sparse matrices in python?

like image 737
TheONP Avatar asked Jan 24 '13 21:01

TheONP


People also ask

How does SciPy sparse matrix work?

Python's SciPy provides tools for creating sparse matrices using multiple data structures, as well as tools for converting a dense matrix to a sparse matrix. The sparse matrix representation outputs the row-column tuple where the matrix contains non-zero values along with those values.

What is SciPy sparse in Python?

Sparse data is data that has mostly unused elements (elements that don't carry any information ). It can be an array like this one: [1, 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 0] Sparse Data: is a data set where most of the item values are zero.

What does Csr_matrix do in Python?

csr_matrix((data, indices, indptr), [shape=(M, N)]) is the standard CSR representation where the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]] .

How do SciPy sparse matrices multiply?

We use the multiply() method provided in both csc_matrix and csr_matrix classes to multiply two sparse matrices. We can multiply two matrices of same format( both matrices are csc or csr format) and also of different formats ( one matrix is csc and other is csr format).


2 Answers

Logical operations are not supported for sparse matrices, but converting back to a 'bool' is not all that expensive. Actually, if using LIL format matrices, the conversion may appear to take negative time due to performance fluctuations:

a = scipy.sparse.rand(10000, 10000, density=0.001, format='lil').astype('bool')
b = scipy.sparse.rand(10000, 10000, density=0.001, format='lil').astype('bool')

In [2]: %timeit a+b
10 loops, best of 3: 61.2 ms per loop

In [3]: %timeit (a+b).astype('bool')
10 loops, best of 3: 60.4 ms per loop

You may have noticed that your LIL matrices were converted to CSR format before adding them together, look at the return format. If you had already been using CSR format to begin with, then the conversion overhead becomes more noticeable:

In [14]: %timeit a+b
100 loops, best of 3: 2.28 ms per loop

In [15]: %timeit (a+b).astype(bool)
100 loops, best of 3: 2.96 ms per loop

CSR (and CSC) matrices have a data attribute which is a 1D array that holds the actual non-zero entries of the sparse matrix, so the cost of recasting your sparse matrix will depend on the number of non-zero entries of your matrix, not its size:

a = scipy.sparse.rand(10000, 10000, density=0.0005, format='csr').astype('int8')
b = scipy.sparse.rand(1000, 1000, density=0.5, format='csr').astype('int8')

In [4]: %timeit a.astype('bool') # a is 10,000x10,000 with 50,000 non-zero entries
10000 loops, best of 3: 93.3 us per loop

In [5]: %timeit b.astype('bool') # b is 1,000x1,000 with 500,000 non-zero entries
1000 loops, best of 3: 1.7 ms per loop
like image 176
Jaime Avatar answered Oct 18 '22 15:10

Jaime


You can easily express Boolean operations by the following means. Then it works with sparse matrices.

a.multiply(b) #AND
a+b           #OR
(a>b)+(a<b)   #XOR
a>b           #NOT

So Boolean operations are supported.

like image 40
Radio Controlled Avatar answered Oct 18 '22 15:10

Radio Controlled