Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all-zero columns in pandas sparse matrix

For example I have a coo_matrix A :

import scipy.sparse as sp
A = sp.coo_matrix([3,0,3,0],
                  [0,0,2,0],
                  [2,5,1,0],
                  [0,0,0,0])

How can I get result [0,0,0,1], which indicates that first 3 columns contain non-zero values, only the 4th column is all zeros.

PS : cannot convert A to other type.
PS2 : I tried using np.nonzeros but it seems that my implementation is not very elegant.

like image 496
wilbeibi Avatar asked Sep 26 '16 20:09

wilbeibi


3 Answers

Approach #1 We could do something like this -

# Get the columns indices of the input sparse matrix
C = sp.find(A)[1]

# Use np.in1d to create a mask of non-zero columns. 
# So, we invert it and convert to int dtype for desired output.
out = (~np.in1d(np.arange(A.shape[1]),C)).astype(int)

Alternatively, to make the code shorter, we can use subtraction -

out = 1-np.in1d(np.arange(A.shape[1]),C)

Step-by-step run -

1) Input array and sparse matrix from it :

In [137]: arr             # Regular dense array
Out[137]: 
array([[3, 0, 3, 0],
       [0, 0, 2, 0],
       [2, 5, 1, 0],
       [0, 0, 0, 0]])

In [138]: A = sp.coo_matrix(arr) # Convert to sparse matrix as input here on

2) Get non-zero column indices :

In [139]: C = sp.find(A)[1]

In [140]: C
Out[140]: array([0, 2, 2, 0, 1, 2], dtype=int32)

3) Use np.in1d to get mask of non-zero columns :

In [141]: np.in1d(np.arange(A.shape[1]),C)
Out[141]: array([ True,  True,  True, False], dtype=bool)

4) Invert it :

In [142]: ~np.in1d(np.arange(A.shape[1]),C)
Out[142]: array([False, False, False,  True], dtype=bool)

5) Finally convert to int dtype :

In [143]: (~np.in1d(np.arange(A.shape[1]),C)).astype(int)
Out[143]: array([0, 0, 0, 1])

Alternative subtraction approach :

In [145]: 1-np.in1d(np.arange(A.shape[1]),C)
Out[145]: array([0, 0, 0, 1])

Approach #2 Here's another way and possibly a faster one using matrix-multiplication -

out = 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)

Runtime test

Let's test out all the posted approaches on a big and really sparse matrix -

In [29]: A = sp.coo_matrix((np.random.rand(4000,4000)>0.998).astype(int))

In [30]: %timeit 1-np.in1d(np.arange(A.shape[1]),sp.find(A)[1])
100 loops, best of 3: 4.12 ms per loop # Approach1

In [31]: %timeit 1-np.ones(A.shape[0],dtype=bool)*A.astype(bool)
1000 loops, best of 3: 771 µs per loop # Approach2

In [32]: %timeit 1 - (A.col==np.arange(A.shape[1])[:,None]).any(axis=1)
1 loops, best of 3: 236 ms per loop # @hpaulj's soln

In [33]: %timeit (A!=0).sum(axis=0)==0
1000 loops, best of 3: 1.03 ms per loop  # @jez's soln

In [34]: %timeit (np.sum(np.absolute(A.toarray()), 0) == 0) * 1
10 loops, best of 3: 86.4 ms per loop  # @wwii's soln 
like image 173
Divakar Avatar answered Nov 05 '22 14:11

Divakar


The actual logical operation can be performed like this:

b = (A!=0).sum(axis=0)==0
# matrix([[False, False, False,  True]], dtype=bool)

Now, to ensure that I'm answering your question exactly, I'd better tell you how you could convert from booleans to integers (although really, for most applications I can think of, you can do a lot more in numpy and friends if you stick with an array of bools):

b = b.astype(int)
#matrix([[0, 0, 0, 1]])

Either way, to then convert from a matrix to a list, you could do this:

c = list(b.flat)
# [0, 0, 0, 1]

...although again, I'm not sure this is the best thing to do: for most applications I can imagine, I would perhaps just convert to a one-dimensional numpy.array with c = b.A.flatten() instead.

like image 22
jez Avatar answered Nov 05 '22 14:11

jez


Recent

scipy.sparse.coo_matrix how to fast find all zeros column, fill with 1 and normalize

similar, except it wants to fill those columns with 1s and normalize them.

I immediately suggested the lil format of the transpose. All-0 columns will be empty lists in this format. But sticking with the coo format I suggested

np.nonzero(~(Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1))[0]

or for this 1/0 format

1 - (Mo.col==np.arange(Mo.shape[1])[:,None]).any(axis=1)

which is functionally the same as:

1 - np.in1d(np.arange(Mo.shape[1]),Mo.col)

sparse.find converts the matrix to csr to sum duplicates and eliminate duplicates, and then back to coo to get the data, row, and col attributes (which it returns).

Mo.nonzero uses A.data != 0 to eliminate 0s before returning the col and row attributes.

The np.ones(A.shape[0],dtype=bool)*A.astype(bool) solution requires converting A to csr format for multiplication.

(A!=0).sum(axis=0) also converts to csr because column (or row) sum is done with a matrix multiplication.

So the no-convert requirement is unrealistic, at least within the bounds of sparse formats.

===============

For Divakar's test case my == version is quite slow; it's ok with small ones, but creates too large of test array with the 1000 columns.

Testing on a matrix that is sparse enough to have a number of 0 columns:

In [183]: Arr=sparse.random(1000,1000,.001)
In [184]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).any()
Out[184]: True
In [185]: (1-np.in1d(np.arange(Arr.shape[1]),Arr.col)).sum()
Out[185]: 367

In [186]: timeit 1-np.ones(Arr.shape[0],dtype=bool)*Arr.astype(bool)
1000 loops, best of 3: 334 µs per loop
In [187]: timeit 1-np.in1d(np.arange(Arr.shape[1]),Arr.col)
1000 loops, best of 3: 323 µs per loop
In [188]: timeit 1-(Arr.col==np.arange(Arr.shape[1])[:,None]).any(axis=1)
100 loops, best of 3: 3.9 ms per loop
In [189]: timeit (Arr!=0).sum(axis=0)==0
1000 loops, best of 3: 820 µs per loop
like image 1
hpaulj Avatar answered Nov 05 '22 16:11

hpaulj