I have sparse vectors with dimensionalities of around 200.000. I also have a matrix with the same amount of columns, and the same amount of rows as the number of vectors. I want to set all of these in an incremental fashion to the matrix, that is, the first vector should get set to the first row, and so on.
Currently, the matrix and vectors are of type scipy.sparse.lil_matrix. The vectors get set to a specific row of the matrix using the following function:
In [7]: us.get_utterance_representation('here is a sentence')
Out[7]:
<1x188796 sparse matrix of type '<type 'numpy.float64'>'
with 22489 stored elements in Compressed Sparse Row format>
def set_row_vector(self, row, rowvector):
self.matrix[row] = rowvector[0]
for row, utterance in enumerate(utterances):
uvector = self.get_utterance_representation(utterance)
self.utterancematrix.add_row_vector(row, uvector)
Where uvector is a lil_matrix of dimensionality 1x~200.000.
Creating the matrix in this way turns out to be highly inefficient, where one single text string (utterance), takes up to 5 seconds. Looking at the profiling, I have come to the conclusion that it is the setting of the vector as a row in the matrix that is the main problem.
55 def set_row_vector(self, row, rowvector):
2564609 function calls (2564606 primitive calls) in 5.046 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
22489 1.397 0.000 1.397 0.000 {numpy.core.multiarray.where}
22489 0.783 0.000 2.188 0.000 csr.py:281(_get_single_element)
44978 0.365 0.000 0.916 0.000 stride_tricks.py:35(broadcast_arrays)
44978 0.258 0.000 0.413 0.000 stride_tricks.py:22(as_strided)
202490 0.244 0.000 0.244 0.000 {numpy.core.multiarray.array}
22489 0.199 0.000 2.221 0.000 lil.py:280(__setitem__)
44978 0.174 0.000 0.399 0.000 sputils.py:171(_unpack_index)
584777 0.171 0.000 0.171 0.000 {isinstance}
44988 0.170 0.000 0.230 0.000 sputils.py:115(isintlike)
67467 0.166 0.000 0.278 0.000 sputils.py:196(_check_boolean)
22489 0.154 0.000 0.647 0.000 sputils.py:215(_index_to_arrays)
1 0.129 0.129 5.035 5.035 dsm_classes.py:55(set_row_vector)
22489 0.120 0.000 0.171 0.000 lil.py:247(_insertat2)
67467 0.102 0.000 0.102 0.000 {method 'ravel' of 'numpy.ndarray' objects}
My question is, is there a better way to accomplish the creation of the matrix from the utterances?
(Thank you)
Sparse matrices are often stored in compressed sparse row (CSR) format, which stores values and column indices of all elements in two separate arrays where elements of each row are stored continuously in memory. Row starts are stored in a third array which enables efficient access to sparse rows.
The compressed sparse row (CSR) or compressed row storage (CRS) or Yale format represents a matrix M by three (one-dimensional) arrays, that respectively contain nonzero values, the extents of rows, and column indices. It is similar to COO, but compresses the row indices, hence the name.
The dimensionality of the sparse matrix can be reduced by first representing the dense matrix as a Compressed sparse row representation in which the sparse matrix is represented using three one-dimensional arrays for the non-zero values, the extents of the rows, and the column indexes.
The function csr_matrix() is used to create a sparse matrix of compressed sparse row format whereas csc_matrix() is used to create a sparse matrix of compressed sparse column format.
First, I think your uvector
is actually in CSR format, not LIL. This is probably for the best, however:
In [30]: import scipy.sparse as ss
In [31]: row = ss.rand(1,5000,0.1,'csr')
In [32]: matrix = ss.lil_matrix((30,5000))
In [33]: %timeit matrix[0] = row
10 loops, best of 3: 65.6 ms per loop
In [34]: row_lil = row.tolil()
In [35]: %timeit matrix[0] = row_lil
10 loops, best of 3: 93.4 ms per loop
Next, you can avoid some cost by dropping the [0]
subscript on your rowvector
:
In [38]: %timeit matrix[0] = row[0]
10 loops, best of 3: 104 ms per loop
In [39]: %timeit matrix[0] = row
10 loops, best of 3: 68.7 ms per loop
Finally, the real meat of the solution is to avoid the LIL format whenever possible. While it's the most flexible format, it's also the slowest (generally). For instance, if you're just trying to build your matrix one row at a time, you can use scipy.sparse.vstack
:
In [40]: %%timeit
....: for i in xrange(matrix.shape[0]):
....: matrix[i] = row
....:
1 loops, best of 3: 3.14 s per loop
In [41]: %timeit ss.vstack([row for i in xrange(matrix.shape[0])])
1000 loops, best of 3: 1.46 ms per loop
In [44]: m2 = ss.vstack([row for i in xrange(matrix.shape[0])])
In [45]: numpy.allclose(matrix.todense(), m2.todense())
Out[45]: True
EDIT: If memory is a concern and you still want maximum speed, you can make your own vstack
based on the fast vstack
for CSR matrices. I would start by copying the _compressed_sparse_stack
function and calling it with your list of CSR rows and axis = 0
. Then, you should be able to modify it to take an iterator instead of a list, which would avoid the high memory overhead. Or, you could inline the steps into your for loop. Either way, you'll lose a little speed, but potentially save a lot of memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With