I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation:
LiL matrix:
class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None, copy=False)[source] Row-based linked list sparse matrix
This is an efficient structure for constructing sparse matrices incrementally.
DoK matrix:
class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None, copy=False)[source] Dictionary Of Keys based sparse matrix.
This is an efficient structure for constructing sparse matrices incrementally.
But when I'm running benchmarks comparing to building a dictionary of dictionary of values (which later can be easily converted to sparse matrix), the latter turns out to be about 10-20 times faster than using any of the sparse matrix models:
from scipy.sparse import dok_matrix, lil_matrix
from timeit import timeit
from collections import defaultdict
def common_dict(rows, cols):
freqs = defaultdict(lambda: defaultdict(int))
for row, col in zip(rows, cols):
freqs[row][col] += 1
return freqs
def dok(rows, cols):
freqs = dok_matrix((1000,1000))
for row, col in zip(rows, cols):
freqs[row,col] += 1
return freqs
def lil(rows, cols):
freqs = lil_matrix((1000,1000))
for row, col in zip(rows, cols):
freqs[row,col] += 1
return freqs
def benchmark():
cols = range(1000)
rows = range(1000)
res = timeit("common_dict({},{})".format(rows, cols),
"from __main__ import common_dict",
number=100)
print("common_dict: {}".format(res))
res = timeit("dok({},{})".format(rows, cols),
"from __main__ import dok",
number=100)
print("dok: {}".format(res))
res = timeit("lil({},{})".format(rows, cols),
"from __main__ import lil",
number=100)
print("lil: {}".format(res))
Results:
benchmark()
common_dict: 0.11778324202168733
dok: 2.2927695910912007
lil: 1.3541790939634666
What is it that causes such a overhead for the matrix models, and is there some way to speed it up? Are there use cases where either dok or lil are to prefer over a common dict of dicts?
Using sparse matrices to store data that contains a large number of zero-valued elements can both save a significant amount of memory and speed up the processing of that data. sparse is an attribute that you can assign to any two-dimensional MATLAB® matrix that is composed of double or logical elements.
A sparse matrix is a matrix that is comprised of mostly zero values. Sparse matrices are distinct from matrices with mostly non-zero values, which are referred to as dense matrices. A matrix is sparse if many of its coefficients are zero.
The compressed sparse row (CSR) or compressed row storage (CRS) or Yale format represents a matrix M by three (one-dimensional) arrays, that respectively contain nonzero values, the extents of rows, and column indices. It is similar to COO, but compresses the row indices, hence the name.
The COO is also known as the transactional format. In this format, the matrix is represented as a set of triples , where x is an entry in the matrix and i and j denote its row and column indices, respectively.
When I change your +=
to just =
for your 2 sparse arrays:
for row, col in zip(rows, cols):
#freqs[row,col] += 1
freqs[row,col] = 1
their respective times are cut in half. What's consuming the most time is the indexing. With +=
it is has to do both a __getitem__
and a __setitem__
.
When the docs say that dok
and lil
are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.
When I try to make a csr
matrix with your code, I get a:
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning)
and 30x slower speed.
So the speed claims are relative to formats like csr
, not relative to pure Python or numpy
structures.
You might want to look at the Python code for dok_matrix.__get_item__
and dok_matrix.__set_item__
to see what happens when you do freq[r,c]
.
A faster way to construct your dok
would be:
freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
d[(row, col)] = 1
freqs.update(d)
taking advantage of the fact that a dok
is a subclassed dictionary. Note that dok
matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50)
.
Another fast way of constructing the same sparse array is:
freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))
In other words, since you already have the rows
and cols
arrays (or ranges), calculate the corresponding data
array, and THEN construct the sparse array.
But if you must perform sparse operations on your matrix between incremental growth steps, then dok
or lil
may be your best choices.
Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr
format is the ultimate goal, and the coo
format was a convenient initialization format.
Now many of the SO scipy sparse questions arise from scikit-learn
and text analysis problems. They are also used in a biological database files. But still the (data),(row,col)
definition method works best.
So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.
Here's a faster dok
iteration that takes advantage of its dictionary methods. update
seems to work as fast as on a plain dictionary. get
is about 3x faster the equivalent indexing (freq[row,col]
). Indexing probably uses get
, but must have a lot of overhead.
def fast_dok(rows, cols):
freqs = dok_matrix((1000,1000))
for row, col in zip(rows,cols):
i = freqs.get((row,col),0)
freqs.update({(row,col):i+1})
return freqs
Skipping the get
, and just doing
freqs.update({(row,col): 1)
is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)}
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With