I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matrices when processing these further, e.g. in a classifier. When feeding a numpy array versus a sparse matrix into the Scikit logistic regression classifier, it did not seem to make much of a difference, however. So I was wondering about three things:
Wikipedia says
a sparse matrix is a matrix in which most of the elements are zero
Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?
Any help is much appreciated!
Numpy matrices are strictly 2-dimensional, while numpy arrays (ndarrays) are N-dimensional. Matrix objects are a subclass of ndarray, so they inherit all the attributes and methods of ndarrays.
Sparse representations are useful because they do not store every element. If one particular value, typically this is zero, appears many times in the sparse array, it can be much more efficient if only elements that are different from this common value are stored.
Sparse Matrices in PythonA dense matrix stored in a NumPy array can be converted into a sparse matrix using the CSR representation by calling the csr_matrix() function.
Using sparse matrices to store data that contains a large number of zero-valued elements can both save a significant amount of memory and speed up the processing of that data. sparse is an attribute that you can assign to any two-dimensional MATLAB® matrix that is composed of double or logical elements.
The scipy
sparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the dot
product for numpy arrays) and equation solvers are well developed.
My rough experience is that a sparse csr
matrix product has to have a 1% sparsity to be faster than the equivalent dense dot
operation - in other words, one nonzero value for every 99 zeros. (but see tests below)
But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the coo
format). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that.
The scipy
package implements many sparse formats. The coo
format is easiest to understand and build. Build one according to documentation and look at its .data
, .row
, and .col
attributes (3 1d arrays).
csr
and csc
are typically built from the coo
format, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality.
It is also possible to index csr
format, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower.
lil
(lists of lists) is also easy to understand, and best for incremental building. dok
is a actually a dictionary subclass.
A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the np.matrix
class (though it isn't a subclass).
A search for other questions using scikit-learn
and sparse
might be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the learn
side. So far the sparse
package has not been optimized for this application.
I just tried some matrix product tests, using the sparse.random
method to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected.
In [251]: M=sparse.random(1000,1000,.5) In [252]: timeit M1=M*M 1 loops, best of 3: 2.78 s per loop In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma) 1 loops, best of 3: 4.28 s per loop
It is a size issue; for smaller matrix the dense dot
is faster
In [255]: M=sparse.random(100,100,.5) In [256]: timeit M1=M*M 100 loops, best of 3: 3.24 ms per loop In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma) 1000 loops, best of 3: 1.44 ms per loop
But compare indexing
In [268]: timeit M.tocsr()[500,500] 10 loops, best of 3: 86.4 ms per loop In [269]: timeit Ma[500,500] 1000000 loops, best of 3: 318 ns per loop In [270]: timeit Ma=M.toarray();Ma[500,500] 10 loops, best of 3: 23.6 ms per loop
@hpaulj Your timeit is wrong, u are getting slow results cause of mapping sparse.random to numpy array (its slowish) with that in mind:
M=sparse.random(1000,1000,.5) Ma=M.toarray() %timeit -n 25 M1=M*M 352 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 25 loops each) %timeit -n 25 M2=Ma.dot(Ma) 13.5 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 25 loops each)
To get close to numpy we need to have
M=sparse.random(1000,1000,.03) %timeit -n 25 M1=M*M 10.7 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 25 loops each) %timeit -n 25 M2=Ma.dot(Ma) 11.4 ms ± 564 µs per loop (mean ± std. dev. of 7 runs, 25 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With