Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

indices of sparse_csc matrix are reversed after extracting some columns

I'm trying to extract columns of a scipy sparse column matrix, but the result is not stored as I'd expect. Here's what I mean:

In [77]: a = scipy.sparse.csc_matrix(np.ones([4, 5]))
In [78]: ind = np.array([True, True, False, False, False])
In [79]: b = a[:, ind]

In [80]: b.indices
Out[80]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)

In [81]: a.indices
Out[81]: array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)

How come b.indices is not [0, 1, 2, 3, 0, 1, 2, 3] ? And since this behaviour is not the one I expect, is a[:, ind] not the correct way to extract columns from a csc matrix?

like image 895
P. Camilleri Avatar asked Sep 02 '25 02:09

P. Camilleri


2 Answers

The indices are not sorted. You can either force the looping by reversing in a's rows, which is not that intuitive, or enforce sorted indices (you can also do it in-place, but I prefer casting). What I find funny is that the has_sorted_indices attribute does not always return a boolean, but mixes it with integer representation.

a = scipy.sparse.csc_matrix(np.ones([4, 5]))
ind = np.array([True, True, False, False, False])
b = a[::-1, ind]
b2 = a[:, ind]
b3 = b2.sorted_indices()

b.indices
>>array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b.has_sorted_indices
>>1
b2.indices
>>array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
b2.has_sorted_indices
>>0
b3.indices
array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b3.has_sorted_indices
>>True
like image 188
Uvar Avatar answered Sep 05 '25 07:09

Uvar


csc and csr indices are not guaranteed to be sorted. I can't off hand find documentation to the effect, but the has_sort_indices and the sort methods suggest that.

In your case the order is the result of how the indexing is done. I found in previous SO questions, that multicolumn indexing is performed with a matrix multiplication:

In [165]: a = sparse.csc_matrix(np.ones([4,5]))
In [166]: b = a[:,[0,1]]
In [167]: b.indices
Out[167]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)

This indexing is the equivalent to constructing a 'selection' matrix:

In [169]: I = sparse.csr_matrix(np.array([[1,0,0,0,0],[0,1,0,0,0]]).T)
In [171]: I.A
Out[171]: 
array([[1, 0],
       [0, 1],
       [0, 0],
       [0, 0],
       [0, 0]], dtype=int32)

and doing this matrix multiplication:

In [172]: b1 = a * I
In [173]: b1.indices
Out[173]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)

The order is the result of how the matrix multiplication was done. In fact a * a.T does the same reversal. We'd have to examine the multiplication code to know exactly why. Evidently the csc and csr calculation code doesn't require sorted indices, and doesn't bother to ensure the results are sorted.

https://docs.scipy.org/doc/scipy-0.19.1/reference/sparse.html#further-details

Further Details¶ CSR column indices are not necessarily sorted. Likewise for CSC row indices. Use the .sorted_indices() and .sort_indices() methods when sorted indices are required (e.g. when passing data to other libraries).

like image 41
hpaulj Avatar answered Sep 05 '25 08:09

hpaulj