Efficiently set row in SciPy sparse.lil_matrix?

Tags:

I have sparse vectors with dimensionalities of around 200.000. I also have a matrix with the same amount of columns, and the same amount of rows as the number of vectors. I want to set all of these in an incremental fashion to the matrix, that is, the first vector should get set to the first row, and so on.

Currently, the matrix and vectors are of type scipy.sparse.lil_matrix. The vectors get set to a specific row of the matrix using the following function:

Click to copy

In [7]: us.get_utterance_representation('here is a sentence')
Out[7]:
<1x188796 sparse matrix of type '<type 'numpy.float64'>'
    with 22489 stored elements in Compressed Sparse Row format>

def set_row_vector(self, row, rowvector):
    self.matrix[row] = rowvector[0]

for row, utterance in enumerate(utterances):
    uvector = self.get_utterance_representation(utterance)
    self.utterancematrix.add_row_vector(row, uvector)

Where uvector is a lil_matrix of dimensionality 1x~200.000.

Creating the matrix in this way turns out to be highly inefficient, where one single text string (utterance), takes up to 5 seconds. Looking at the profiling, I have come to the conclusion that it is the setting of the vector as a row in the matrix that is the main problem.

Click to copy

     55     def set_row_vector(self, row, rowvector):
         2564609 function calls (2564606 primitive calls) in 5.046 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    22489    1.397    0.000    1.397    0.000 {numpy.core.multiarray.where}
    22489    0.783    0.000    2.188    0.000 csr.py:281(_get_single_element)
    44978    0.365    0.000    0.916    0.000 stride_tricks.py:35(broadcast_arrays)
    44978    0.258    0.000    0.413    0.000 stride_tricks.py:22(as_strided)
   202490    0.244    0.000    0.244    0.000 {numpy.core.multiarray.array}
    22489    0.199    0.000    2.221    0.000 lil.py:280(__setitem__)
    44978    0.174    0.000    0.399    0.000 sputils.py:171(_unpack_index)
   584777    0.171    0.000    0.171    0.000 {isinstance}
    44988    0.170    0.000    0.230    0.000 sputils.py:115(isintlike)
    67467    0.166    0.000    0.278    0.000 sputils.py:196(_check_boolean)
    22489    0.154    0.000    0.647    0.000 sputils.py:215(_index_to_arrays)
        1    0.129    0.129    5.035    5.035 dsm_classes.py:55(set_row_vector)
    22489    0.120    0.000    0.171    0.000 lil.py:247(_insertat2)
    67467    0.102    0.000    0.102    0.000 {method 'ravel' of 'numpy.ndarray' objects}

My question is, is there a better way to accomplish the creation of the matrix from the utterances?

(Thank you)

365

asked Jan 30 '14 16:01

Jimmy C

1 Answers

First, I think your uvector is actually in CSR format, not LIL. This is probably for the best, however:

Click to copy

In [30]: import scipy.sparse as ss

In [31]: row = ss.rand(1,5000,0.1,'csr')

In [32]: matrix = ss.lil_matrix((30,5000))

In [33]: %timeit matrix[0] = row
10 loops, best of 3: 65.6 ms per loop

In [34]: row_lil = row.tolil()

In [35]: %timeit matrix[0] = row_lil
10 loops, best of 3: 93.4 ms per loop

Next, you can avoid some cost by dropping the [0] subscript on your rowvector:

Click to copy

In [38]: %timeit matrix[0] = row[0]
10 loops, best of 3: 104 ms per loop

In [39]: %timeit matrix[0] = row
10 loops, best of 3: 68.7 ms per loop

Finally, the real meat of the solution is to avoid the LIL format whenever possible. While it's the most flexible format, it's also the slowest (generally). For instance, if you're just trying to build your matrix one row at a time, you can use scipy.sparse.vstack:

Click to copy

In [40]: %%timeit
   ....: for i in xrange(matrix.shape[0]):
   ....:   matrix[i] = row
   ....:
1 loops, best of 3: 3.14 s per loop

In [41]: %timeit ss.vstack([row for i in xrange(matrix.shape[0])])
1000 loops, best of 3: 1.46 ms per loop

In [44]: m2 = ss.vstack([row for i in xrange(matrix.shape[0])])

In [45]: numpy.allclose(matrix.todense(), m2.todense())
Out[45]: True

EDIT: If memory is a concern and you still want maximum speed, you can make your own vstack based on the fast vstack for CSR matrices. I would start by copying the _compressed_sparse_stack function and calling it with your list of CSR rows and axis = 0. Then, you should be able to modify it to take an iterator instead of a list, which would avoid the high memory overhead. Or, you could inline the steps into your for loop. Either way, you'll lose a little speed, but potentially save a lot of memory.

177

answered Oct 08 '22 00:10

perimosocordiae

Related questions
                            
                                correct me for using generators or tell me other way
                            
                                Error in importing ctypes.wintypes
                            
                                How to read cookie in python
                            
                                BeautifulSoup soup.prettify() gives strange output
                            
                                JWTAssertionCredentials with service account fails with asn1 not enough data error
                            
                                django - How to insert the content of uploaded file in template?
                            
                                Django 1.6 - templates for password change/reset
                            
                                Line break or "\n" is not working.
                            
                                Passing values from one Python program to another
                            
                                OpenCV RGB single channel color regulation
                            
                                Why does Pillow not recognize the JPEG format?
                            
                                What is the equivalent of psycopg curs.mogrify on mysql?
                            
                                Error in SQLAlchemy with Integer: "object() takes no parameters"
                            
                                functools.partial and generators
                            
                                While generating all possible combinations itertools.combinations_with_replacement() vs itertools.product()?
                            
                                Appending to a DataFrame converts dtypes
                            
                                How to find set of most frequently occurring word-pairs in a file using python?
                            
                                Is there a bug in binning in matplotlib histograms? Or non-randomness of the rvs method in scipy.stats
                            
                                Change color implicit plot
                            
                                Python List in a For Loop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently set row in SciPy sparse.lil_matrix?

Tags:

python

numpy

scipy

Jimmy C

People also ask

1 Answers

perimosocordiae

Recent Activity

Donate For Us