Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scipy sparse matrices and cython

I need to perform a set of operations on a scipy sparse matrix in a Cython method.

To efficiently apply these I need access to lil_matrix representation. The lil (linked-list sparse matrix) data representation in python uses list of lists with different lengths.

How can I efficiently pass a list of list of different length to cython (without copying)? Is there any other way to access lil-matrices in cython?

like image 982
SKV Avatar asked Aug 21 '14 16:08

SKV


1 Answers

The example below iterates over a lil_matrix and calculates the sum for each row.

Note I am doing no declarations and even though it is extremely fast because Cython is already optimized for built-in types such as lists. The timings are also shown below...

import time

import numpy as np
cimport numpy as np

from scipy.sparse import lil_matrix

cdef iter_over_lil_matrix(m):
    cdef list sums, data_row
    sums = []
    for data_row in m.data:
        s = 0
        for value in data_row:
            s += value
        sums.append(s)
    return sums

def main():
    a = np.random.random((1e4*1e4))
    a[a>0.1] = 0
    a = a.reshape(1e4,1e4)
    m = lil_matrix(a)

    t0 = time.clock()
    sums = iter_over_lil_matrix(m)
    t1 = time.clock()
    print 'Cython lil_matrix Time', t1-t0

    t0 = time.clock()
    array_sums = a.sum(axis=1)
    t1 = time.clock()
    print 'Numpy ndarray Time', t1-t0

    t0 = time.clock()
    lil_sums = m.sum(axis=1)
    t1 = time.clock()
    print 'lil_matrix Time', t1-t0

    mcsr = m.tocsr()
    t0 = time.clock()
    csr_sums = mcsr.sum(axis=1)
    t1 = time.clock()
    print 'csr_matrix Time', t1-t0

    assert np.allclose(array_sums, sums)
    assert np.allclose(array_sums, np.asarray(lil_sums).flatten())
    assert np.allclose(array_sums, np.asarray(csr_sums).flatten())

Timings in seconds - only about 2 times slower than the super-optimized NumPy :D, much faster than the lil_matrix.sum() method because it converts to csr_matrix() before, as clarified by @hpaulj and confirmed by the results below. Note that the csr_matrix.sum() over the columns is almost one order of magnitude faster than the dense sum.

Cython lil_matrix Time 0.183935034665
Numpy ndarray Time 0.106583238273
lil_matrix Time 2.47158218631
csr_matrix Time 0.0140050888745

Things that will slow down the code:

  • use of for i in range(len(m.data)): with data_row = m.data[i]
  • declare buffers like np.ndarray[object, ndim=1] data with data=m.data

Things that did not affect:

  • boundscheck or wraparound
like image 177
Saullo G. P. Castro Avatar answered Sep 29 '22 00:09

Saullo G. P. Castro