Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is row slicing vs What is column slicing?

Yes, I've read this and this answer, but I cannot still grasp my mind around it... it's a basic question.

In:

M[:, index]
M[index, :]

Which one is row slicing, and which one is column slicing?

And to my problem, if I want to do advanced indexing for the columns like:

M[:, indexes]  # indexes is an array like [0, 4, 9]

Which sparse matrix type is most efficient for doing M[:, indexes], CSR or CSC ?

like image 815
juanmirocks Avatar asked Feb 02 '17 09:02

juanmirocks


2 Answers

Actually neither is row/column slicing: those are examples of row/column indexing instead.

  • M[index, :] is row indexing
  • M[:, index] is column indexing
  • M[start:stop, :] is row slicing
  • M[:, start:stop] is column slicing

CSC is more efficient at retrieving entire columns: the non-zero values of a specific column and the matching row indices are internally stored as contiguous arrays in memory.

The dual is true for CSR and the retrieval of an entire row.

like image 126
ogrisel Avatar answered Oct 16 '22 14:10

ogrisel


While row selection is faster for csr than for col selection, the difference is not big:

In [288]: Mbig=sparse.rand(1000,1000,.1, 'csr')
In [289]: Mbig[:1000:50,:]
Out[289]: 
<20x1000 sparse matrix of type '<class 'numpy.float64'>'
    with 2066 stored elements in Compressed Sparse Row format>
In [290]: timeit Mbig[:1000:50,:]
1000 loops, best of 3: 1.53 ms per loop
In [291]: timeit Mbig[:,:1000:50]
100 loops, best of 3: 2.04 ms per loop

In [292]: Mbig=sparse.rand(1000,1000,.1, 'csc')
In [293]: timeit Mbig[:1000:50,:]
100 loops, best of 3: 2.16 ms per loop
In [294]: timeit Mbig[:,:1000:50]
1000 loops, best of 3: 1.65 ms per loop

It isn't worth it to switch format

In [295]: timeit Mbig.tocsr()[:1000:50,:]
...
100 loops, best of 3: 2.46 ms per loop

Contrast this with the same slice for the dense version:

In [297]: A=Mbig.A
In [298]: timeit A[:,:1000:50]
...
1000000 loops, best of 3: 557 ns per loop
In [301]: timeit A[:,:1000:50].copy()
...
10000 loops, best of 3: 52.5 µs per loop

To complicate the comparison, indexing with an array (numpy advanced) is actually faster than with the 'slice':

In [308]: idx=np.r_[0:1000:50]    # expand slice into array
In [309]: timeit Mbig[idx,:]
1000 loops, best of 3: 1.49 ms per loop
In [310]: timeit Mbig[:,idx]
1000 loops, best of 3: 513 µs per loop

Here the column indexing of a csc has a greater speed improvement.

And single row or column, csr and csc have getrow/col methods:

In [314]: timeit Mbig.getrow(500)
1000 loops, best of 3: 434 µs per loop
In [315]: timeit Mbig.getcol(500)        # 1 column from csc is fastest
10000 loops, best of 3: 78.7 µs per loop
In [316]: timeit Mbig[500,:]
1000 loops, best of 3: 505 µs per loop
In [317]: timeit Mbig[:,500]
1000 loops, best of 3: 264 µs per loop

In https://stackoverflow.com/a/39500986/901925 I recreated the extractor code that sparse uses to fetch rows or columns. It constructs a new sparse 'vector' of 1s and 0s, and uses matrix multiplication to 'select' rows or columns.

like image 34
hpaulj Avatar answered Oct 16 '22 16:10

hpaulj