Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are lil_matrix and dok_matrix so slow compared to common dict of dicts?

I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation:

LiL matrix:

class scipy.sparse.lil_matrix(arg1, shape=None, dtype=None, copy=False)[source] Row-based linked list sparse matrix

This is an efficient structure for constructing sparse matrices incrementally.

DoK matrix:

class scipy.sparse.dok_matrix(arg1, shape=None, dtype=None, copy=False)[source] Dictionary Of Keys based sparse matrix.

This is an efficient structure for constructing sparse matrices incrementally.

But when I'm running benchmarks comparing to building a dictionary of dictionary of values (which later can be easily converted to sparse matrix), the latter turns out to be about 10-20 times faster than using any of the sparse matrix models:

from scipy.sparse import dok_matrix, lil_matrix
from timeit import timeit
from collections import defaultdict

def common_dict(rows, cols):
    freqs = defaultdict(lambda: defaultdict(int))
    for row, col in zip(rows, cols):
        freqs[row][col] += 1

    return freqs

def dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs

def lil(rows, cols):
    freqs = lil_matrix((1000,1000))
    for row, col in zip(rows, cols):
        freqs[row,col] += 1

    return freqs


def benchmark():
    cols = range(1000)
    rows = range(1000)

    res = timeit("common_dict({},{})".format(rows, cols), 
                 "from __main__ import common_dict", 
                 number=100)

    print("common_dict: {}".format(res))

    res = timeit("dok({},{})".format(rows, cols), 
                 "from __main__ import dok", 
                 number=100)

    print("dok: {}".format(res))

    res = timeit("lil({},{})".format(rows, cols), 
                 "from __main__ import lil", 
                 number=100)

    print("lil: {}".format(res))

Results:

benchmark()

common_dict: 0.11778324202168733
dok: 2.2927695910912007
lil: 1.3541790939634666

What is it that causes such a overhead for the matrix models, and is there some way to speed it up? Are there use cases where either dok or lil are to prefer over a common dict of dicts?

like image 843
Jimmy C Avatar asked Jan 04 '15 22:01

Jimmy C


People also ask

What is the importance of sparse matrix?

Using sparse matrices to store data that contains a large number of zero-valued elements can both save a significant amount of memory and speed up the processing of that data. sparse is an attribute that you can assign to any two-dimensional MATLAB® matrix that is composed of double or logical elements.

What does a sparse matrix mean?

A sparse matrix is a matrix that is comprised of mostly zero values. Sparse matrices are distinct from matrices with mostly non-zero values, which are referred to as dense matrices. A matrix is sparse if many of its coefficients are zero.

What is CSR format matrix?

The compressed sparse row (CSR) or compressed row storage (CRS) or Yale format represents a matrix M by three (one-dimensional) arrays, that respectively contain nonzero values, the extents of rows, and column indices. It is similar to COO, but compresses the row indices, hence the name.

What is COO format?

The COO is also known as the transactional format. In this format, the matrix is represented as a set of triples , where x is an entry in the matrix and i and j denote its row and column indices, respectively.


1 Answers

When I change your += to just = for your 2 sparse arrays:

for row, col in zip(rows, cols):
    #freqs[row,col] += 1
    freqs[row,col] = 1

their respective times are cut in half. What's consuming the most time is the indexing. With += it is has to do both a __getitem__ and a __setitem__.

When the docs say that dok and lil are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.

When I try to make a csr matrix with your code, I get a:

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning)

and 30x slower speed.

So the speed claims are relative to formats like csr, not relative to pure Python or numpy structures.

You might want to look at the Python code for dok_matrix.__get_item__ and dok_matrix.__set_item__ to see what happens when you do freq[r,c].


A faster way to construct your dok would be:

freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
    d[(row, col)] = 1
freqs.update(d)

taking advantage of the fact that a dok is a subclassed dictionary. Note that dok matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50).

Another fast way of constructing the same sparse array is:

freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))

In other words, since you already have the rows and cols arrays (or ranges), calculate the corresponding data array, and THEN construct the sparse array.

But if you must perform sparse operations on your matrix between incremental growth steps, then dok or lil may be your best choices.


Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr format is the ultimate goal, and the coo format was a convenient initialization format.

Now many of the SO scipy sparse questions arise from scikit-learn and text analysis problems. They are also used in a biological database files. But still the (data),(row,col) definition method works best.

So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.


Here's a faster dok iteration that takes advantage of its dictionary methods. update seems to work as fast as on a plain dictionary. get is about 3x faster the equivalent indexing (freq[row,col]). Indexing probably uses get, but must have a lot of overhead.

def fast_dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows,cols):
         i = freqs.get((row,col),0)
         freqs.update({(row,col):i+1})
    return freqs

Skipping the get, and just doing

         freqs.update({(row,col): 1)

is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)})

like image 147
hpaulj Avatar answered Oct 21 '22 08:10

hpaulj