I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matrices when processing these further, e.g. in a classifier. When feeding a numpy array versus a sparse matrix into the Scikit logistic regression classifier, it did not seem to make much of a difference, however. So I was wondering about three things: <ul> <li> Wikipedia says <blockquote> a sparse matrix is a matrix in which most of the elements are zero </blockquote> Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case? </li> <li>How much does a sparse matrix help performance in a task like mine, especially compared to a numpy array or a standard list?</li> <li>So far, I collect my data into a numpy array, then convert into the csr_matrix in Scipy. Is that the right way to do it? I could not figure out how to build a sparse matrix from the ground up, and that might be impossible.</li> </ul> Any help is much appreciated!

The <code>scipy</code> sparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the <code>dot</code> product for numpy arrays) and equation solvers are well developed. My rough experience is that a sparse <code>csr</code> matrix product has to have a 1% sparsity to be faster than the equivalent dense <code>dot</code> operation - in other words, one nonzero value for every 99 zeros. (but see tests below) But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the <code>coo</code> format). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that. The <code>scipy</code> package implements many sparse formats. The <code>coo</code> format is easiest to understand and build. Build one according to documentation and look at its <code>.data</code>, <code>.row</code>, and <code>.col</code> attributes (3 1d arrays). <code>csr</code> and <code>csc</code> are typically built from the <code>coo</code> format, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality. It is also possible to index <code>csr</code> format, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower. <code>lil</code> (lists of lists) is also easy to understand, and best for incremental building. <code>dok</code> is a actually a dictionary subclass. A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the <code>np.matrix</code> class (though it isn't a subclass). A search for other questions using <code>scikit-learn</code> and <code>sparse</code> might be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the <code>learn</code> side. So far the <code>sparse</code> package has not been optimized for this application. <hr> I just tried some matrix product tests, using the <code>sparse.random</code> method to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected. <pre class="prettyprint"><code>In [251]: M=sparse.random(1000,1000,.5) In [252]: timeit M1=M*M 1 loops, best of 3: 2.78 s per loop In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma) 1 loops, best of 3: 4.28 s per loop </code></pre> It is a size issue; for smaller matrix the dense <code>dot</code> is faster <pre class="prettyprint"><code>In [255]: M=sparse.random(100,100,.5) In [256]: timeit M1=M*M 100 loops, best of 3: 3.24 ms per loop In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma) 1000 loops, best of 3: 1.44 ms per loop </code></pre> But compare indexing <pre class="prettyprint"><code>In [268]: timeit M.tocsr()[500,500] 10 loops, best of 3: 86.4 ms per loop In [269]: timeit Ma[500,500] 1000000 loops, best of 3: 318 ns per loop In [270]: timeit Ma=M.toarray();Ma[500,500] 10 loops, best of 3: 23.6 ms per loop </code></pre>

Using a sparse matrix versus numpy array

Tags:

python

matrix

numpy

scipy

scikit-learn

I am creating some numpy arrays with word counts in Python: rows are documents, columns are counts for word X. If I have a lot of zero counts, people suggest using sparse matrices when processing these further, e.g. in a classifier. When feeding a numpy array versus a sparse matrix into the Scikit logistic regression classifier, it did not seem to make much of a difference, however. So I was wondering about three things:

Wikipedia says

a sparse matrix is a matrix in which most of the elements are zero

Is that an appropriate way to determine when to use a sparse matrix format - as soon as > 50 % of the values are zero? Or does it make sense to use just in case?
How much does a sparse matrix help performance in a task like mine, especially compared to a numpy array or a standard list?
So far, I collect my data into a numpy array, then convert into the csr_matrix in Scipy. Is that the right way to do it? I could not figure out how to build a sparse matrix from the ground up, and that might be impossible.

Any help is much appreciated!

886

asked May 01 '16 17:05

patrick

2 Answers

The scipy sparse matrix package, and similar ones in MATLAB, was based on ideas developed from linear algebra problems, such as solving large sparse linear equations (e.g. finite difference and finite element implementations). So things like matrix product (the dot product for numpy arrays) and equation solvers are well developed.

My rough experience is that a sparse csr matrix product has to have a 1% sparsity to be faster than the equivalent dense dot operation - in other words, one nonzero value for every 99 zeros. (but see tests below)

But people also try to use sparse matrices to save memory. But keep in mind that such a matrix has to store 3 arrays of values (at least in the coo format). So the sparsity has to be less than 1/3 to start saving memory. Obviously you aren't going to save memory if you first build the dense array, and create the sparse one from that.

The scipy package implements many sparse formats. The coo format is easiest to understand and build. Build one according to documentation and look at its .data, .row, and .col attributes (3 1d arrays).

csr and csc are typically built from the coo format, and compress the data a bit, making them a bit harder to understand. But they have most of the math functionality.

It is also possible to index csr format, though in general this is slower than the equivalent dense matrix/array case. Other operations like changing values (especially from 0 to nonzero), concatenation, incremental growth, are also slower.

lil (lists of lists) is also easy to understand, and best for incremental building. dok is a actually a dictionary subclass.

A key point is that a sparse matrix is limited to 2d, and in many ways behaves like the np.matrix class (though it isn't a subclass).

A search for other questions using scikit-learn and sparse might be the best way of finding the pros/cons of using these matrices. I've answered a number of questions, but I know the 'sparse' side better than the 'learn' side. I think they are useful, but I get the sense is that the fit isn't always the best. Any customization is on the learn side. So far the sparse package has not been optimized for this application.

I just tried some matrix product tests, using the sparse.random method to create a sparse matrix with a specified sparsity. Sparse matrix multiplication performed better than I expected.

In [251]: M=sparse.random(1000,1000,.5)  In [252]: timeit M1=M*M 1 loops, best of 3: 2.78 s per loop  In [253]: timeit Ma=M.toarray(); M2=Ma.dot(Ma) 1 loops, best of 3: 4.28 s per loop

It is a size issue; for smaller matrix the dense dot is faster

In [255]: M=sparse.random(100,100,.5)  In [256]: timeit M1=M*M 100 loops, best of 3: 3.24 ms per loop  In [257]: timeit Ma=M.toarray(); M2=Ma.dot(Ma) 1000 loops, best of 3: 1.44 ms per loop

But compare indexing

In [268]: timeit M.tocsr()[500,500] 10 loops, best of 3: 86.4 ms per loop  In [269]: timeit Ma[500,500] 1000000 loops, best of 3: 318 ns per loop  In [270]: timeit Ma=M.toarray();Ma[500,500] 10 loops, best of 3: 23.6 ms per loop

124

answered Oct 11 '22 04:10

hpaulj

@hpaulj Your timeit is wrong, u are getting slow results cause of mapping sparse.random to numpy array (its slowish) with that in mind:

M=sparse.random(1000,1000,.5) Ma=M.toarray()  %timeit -n 25 M1=M*M 352 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 25 loops each)  %timeit -n 25 M2=Ma.dot(Ma) 13.5 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 25 loops each)

To get close to numpy we need to have

M=sparse.random(1000,1000,.03)  %timeit -n 25 M1=M*M 10.7 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 25 loops each)  %timeit -n 25 M2=Ma.dot(Ma) 11.4 ms ± 564 µs per loop (mean ± std. dev. of 7 runs, 25 loops each)

answered Oct 11 '22 05:10

komuher

Related questions
                            
                                How to use digit separators for Python integer literals?
                            
                                Running a Jupyter notebook from another notebook
                            
                                python typing signature (typing.Callable) for function with kwargs
                            
                                Pretty print in lxml is failing when I add tags to a parsed tree
                            
                                Getting the same subplot size using matplotlib imshow and scatter
                            
                                Reading contents of a gzip file from a AWS S3 in Python
                            
                                Can I program Nvidia's CUDA using only Python or do I have to learn C?
                            
                                Compare similarity of images using OpenCV with Python
                            
                                Boto - Uploading file to a specific location on Amazon S3
                            
                                'Attempted relative import in non-package' although packages with __init__.py in one directory
                            
                                scikit-learn DBSCAN memory usage
                            
                                Type error Unhashable type:set
                            
                                How to add or increment single item of the Python Counter class
                            
                                Improve Pandas Merge performance
                            
                                How to call a async function from a synchronized code Python
                            
                                How can I use valgrind with Python C++ extensions?
                            
                                Does Python do slice-by-reference on strings?
                            
                                Removing entries from a dictionary based on values
                            
                                Load CSV to Pandas MultiIndex DataFrame
                            
                                Failed to install package Beautiful Soup. Error Message is "SyntaxError: Missing parentheses in call to 'print'"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With