Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the top n values in a row of a scipy sparse matrix

I have a scipy sparse matrix in CSR format. It's 72665x72665 so it's impractical to convert this matrix to a dense matrix to perform operations on (the dense representation of this matrix is like 40 gigs). The matrix is symmetric, and has about 82 million non-zero entries (~1.5%).

What I would like to be able to do is, for each row, I want to get the indices of the largest N values. If this were a numpy array, I would use np.argpartition to do it like so:

    for row in matrix:
        top_n_idx = np.argpartition(row,-n)[-n:]

Is there something similar to this I can do for a sparse matrix?

like image 956
enumaris Avatar asked Oct 16 '25 04:10

enumaris


1 Answers

Improve from @Paul Panzer's solution. Now it can handle the case when any row has less than n values.

def top_n_idx_sparse(matrix, n):
    """Return index of top n values in each row of a sparse matrix."""
    top_n_idx = []
    for le, ri in zip(matrix.indptr[:-1], matrix.indptr[1:]):
        n_row_pick = min(n, ri - le)
        top_n_idx.append(
            matrix.indices[
                le + np.argpartition(matrix.data[le:ri], -n_row_pick)[-n_row_pick:]
            ]
        )
    return top_n_idx

What it does?

The matrix.indptr gives the indices of the beginning of each row stored in the data array. So (lr, ri) is the range of data indices of non-zero values in each row. matrix.data[le:ri] gives the non-zero values of the row. Passing that to np.argpartition(..., -n_row_pick) gives you the local indices that will sort the row for the largest n_row_pick elements from the back. [-n_row_pick:] select those local indices. Then le + shift those local indices back to the indices in the data array. Finally, pass that to matrix.indices to get the largest n values indices in the matrix space.

like image 94
Louis Yang Avatar answered Oct 17 '25 19:10

Louis Yang