I'm trying to extract columns of a scipy sparse column matrix, but the result is not stored as I'd expect. Here's what I mean:
In [77]: a = scipy.sparse.csc_matrix(np.ones([4, 5]))
In [78]: ind = np.array([True, True, False, False, False])
In [79]: b = a[:, ind]
In [80]: b.indices
Out[80]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
In [81]: a.indices
Out[81]: array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
How come b.indices
is not [0, 1, 2, 3, 0, 1, 2, 3]
?
And since this behaviour is not the one I expect, is a[:, ind]
not the correct way to extract columns from a csc matrix?
The indices are not sorted. You can either force the looping by reversing in a's rows, which is not that intuitive, or enforce sorted indices (you can also do it in-place, but I prefer casting). What I find funny is that the has_sorted_indices attribute does not always return a boolean, but mixes it with integer representation.
a = scipy.sparse.csc_matrix(np.ones([4, 5]))
ind = np.array([True, True, False, False, False])
b = a[::-1, ind]
b2 = a[:, ind]
b3 = b2.sorted_indices()
b.indices
>>array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b.has_sorted_indices
>>1
b2.indices
>>array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
b2.has_sorted_indices
>>0
b3.indices
array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b3.has_sorted_indices
>>True
csc
and csr
indices are not guaranteed to be sorted. I can't off hand find documentation to the effect, but the has_sort_indices
and the sort
methods suggest that.
In your case the order is the result of how the indexing is done. I found in previous SO questions, that multicolumn indexing is performed with a matrix multiplication:
In [165]: a = sparse.csc_matrix(np.ones([4,5]))
In [166]: b = a[:,[0,1]]
In [167]: b.indices
Out[167]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
This indexing is the equivalent to constructing a 'selection' matrix:
In [169]: I = sparse.csr_matrix(np.array([[1,0,0,0,0],[0,1,0,0,0]]).T)
In [171]: I.A
Out[171]:
array([[1, 0],
[0, 1],
[0, 0],
[0, 0],
[0, 0]], dtype=int32)
and doing this matrix multiplication:
In [172]: b1 = a * I
In [173]: b1.indices
Out[173]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
The order is the result of how the matrix multiplication was done. In fact a * a.T
does the same reversal. We'd have to examine the multiplication code to know exactly why. Evidently the csc
and csr
calculation code doesn't require sorted indices
, and doesn't bother to ensure the results are sorted.
https://docs.scipy.org/doc/scipy-0.19.1/reference/sparse.html#further-details
Further Details¶ CSR column indices are not necessarily sorted. Likewise for CSC row indices. Use the .sorted_indices() and .sort_indices() methods when sorted indices are required (e.g. when passing data to other libraries).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With