Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently set row in SciPy sparse.lil_matrix?

I have sparse vectors with dimensionalities of around 200.000. I also have a matrix with the same amount of columns, and the same amount of rows as the number of vectors. I want to set all of these in an incremental fashion to the matrix, that is, the first vector should get set to the first row, and so on.

Currently, the matrix and vectors are of type scipy.sparse.lil_matrix. The vectors get set to a specific row of the matrix using the following function:

In [7]: us.get_utterance_representation('here is a sentence')
Out[7]:
<1x188796 sparse matrix of type '<type 'numpy.float64'>'
    with 22489 stored elements in Compressed Sparse Row format>

def set_row_vector(self, row, rowvector):
    self.matrix[row] = rowvector[0]

for row, utterance in enumerate(utterances):
    uvector = self.get_utterance_representation(utterance)
    self.utterancematrix.add_row_vector(row, uvector)

Where uvector is a lil_matrix of dimensionality 1x~200.000.

Creating the matrix in this way turns out to be highly inefficient, where one single text string (utterance), takes up to 5 seconds. Looking at the profiling, I have come to the conclusion that it is the setting of the vector as a row in the matrix that is the main problem.

     55     def set_row_vector(self, row, rowvector):
         2564609 function calls (2564606 primitive calls) in 5.046 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    22489    1.397    0.000    1.397    0.000 {numpy.core.multiarray.where}
    22489    0.783    0.000    2.188    0.000 csr.py:281(_get_single_element)
    44978    0.365    0.000    0.916    0.000 stride_tricks.py:35(broadcast_arrays)
    44978    0.258    0.000    0.413    0.000 stride_tricks.py:22(as_strided)
   202490    0.244    0.000    0.244    0.000 {numpy.core.multiarray.array}
    22489    0.199    0.000    2.221    0.000 lil.py:280(__setitem__)
    44978    0.174    0.000    0.399    0.000 sputils.py:171(_unpack_index)
   584777    0.171    0.000    0.171    0.000 {isinstance}
    44988    0.170    0.000    0.230    0.000 sputils.py:115(isintlike)
    67467    0.166    0.000    0.278    0.000 sputils.py:196(_check_boolean)
    22489    0.154    0.000    0.647    0.000 sputils.py:215(_index_to_arrays)
        1    0.129    0.129    5.035    5.035 dsm_classes.py:55(set_row_vector)
    22489    0.120    0.000    0.171    0.000 lil.py:247(_insertat2)
    67467    0.102    0.000    0.102    0.000 {method 'ravel' of 'numpy.ndarray' objects}

My question is, is there a better way to accomplish the creation of the matrix from the utterances?

(Thank you)

like image 365
Jimmy C Avatar asked Jan 30 '14 16:01

Jimmy C


People also ask

Is sparse matrix memory efficient?

Sparse matrices are often stored in compressed sparse row (CSR) format, which stores values and column indices of all elements in two separate arrays where elements of each row are stored continuously in memory. Row starts are stored in a third array which enables efficient access to sparse rows.

What is compressed sparse row format?

The compressed sparse row (CSR) or compressed row storage (CRS) or Yale format represents a matrix M by three (one-dimensional) arrays, that respectively contain nonzero values, the extents of rows, and column indices. It is similar to COO, but compresses the row indices, hence the name.

How do you reduce sparse matrix in python?

The dimensionality of the sparse matrix can be reduced by first representing the dense matrix as a Compressed sparse row representation in which the sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes.

What does SciPy sparse Csr_matrix do?

The function csr_matrix() is used to create a sparse matrix of compressed sparse row format whereas csc_matrix() is used to create a sparse matrix of compressed sparse column format.


1 Answers

First, I think your uvector is actually in CSR format, not LIL. This is probably for the best, however:

In [30]: import scipy.sparse as ss

In [31]: row = ss.rand(1,5000,0.1,'csr')

In [32]: matrix = ss.lil_matrix((30,5000))

In [33]: %timeit matrix[0] = row
10 loops, best of 3: 65.6 ms per loop

In [34]: row_lil = row.tolil()

In [35]: %timeit matrix[0] = row_lil
10 loops, best of 3: 93.4 ms per loop

Next, you can avoid some cost by dropping the [0] subscript on your rowvector:

In [38]: %timeit matrix[0] = row[0]
10 loops, best of 3: 104 ms per loop

In [39]: %timeit matrix[0] = row
10 loops, best of 3: 68.7 ms per loop

Finally, the real meat of the solution is to avoid the LIL format whenever possible. While it's the most flexible format, it's also the slowest (generally). For instance, if you're just trying to build your matrix one row at a time, you can use scipy.sparse.vstack:

In [40]: %%timeit
   ....: for i in xrange(matrix.shape[0]):
   ....:   matrix[i] = row
   ....:
1 loops, best of 3: 3.14 s per loop

In [41]: %timeit ss.vstack([row for i in xrange(matrix.shape[0])])
1000 loops, best of 3: 1.46 ms per loop

In [44]: m2 = ss.vstack([row for i in xrange(matrix.shape[0])])

In [45]: numpy.allclose(matrix.todense(), m2.todense())
Out[45]: True

EDIT: If memory is a concern and you still want maximum speed, you can make your own vstack based on the fast vstack for CSR matrices. I would start by copying the _compressed_sparse_stack function and calling it with your list of CSR rows and axis = 0. Then, you should be able to modify it to take an iterator instead of a list, which would avoid the high memory overhead. Or, you could inline the steps into your for loop. Either way, you'll lose a little speed, but potentially save a lot of memory.

like image 177
perimosocordiae Avatar answered Oct 08 '22 00:10

perimosocordiae