numpy matrix multiplication to triangular/sparse storage?

Tags:

I'm working with a very large sparse matrix multiplication (matmul) problem. As an example let's say:

A is a binary ( 75 x 200,000 ) matrix. It's sparse, so I'm using csc for storage. I need to do the following matmul operation:
B = A.transpose() * A
The output is going to be a sparse and symmetric matrix of size 200Kx200K.

Unfortunately, B is going to be way to large to store in RAM (or "in core") on my laptop. On the other hand, I'm lucky because there are some properties to B that should solve this problem.

Since B is going to be symmetric along the diagonal and sparse, I could use a triangular matrix (upper/lower) to store the results of the matmul operation and a sparse matrix storage format could further reduce the size.

My question is...can numpy or scipy be told, ahead of time, what the output storage requirements are going to look like so that I can select a storage solution using numpy and avoid the "matrix is too big" runtime error after several minutes (hours) of calculation?

In other words, can storage requirements for the matrix multiply be approximated by analyzing the contents of the two input matrices using an approximate counting algorithm?

https://en.wikipedia.org/wiki/Approximate_counting_algorithm

If not, I'm looking into a brute force solution. Something involving map/reduce, out-of-core storage, or a matmul subdivision solution (strassens algorithm) from the following web links:

A couple Map/Reduce problem subdivision solutions

http://www.norstad.org/matrix-multiply/index.html
http://bpgergo.blogspot.com/2011/08/matrix-multiplication-in-python.html

A out-of-core (PyTables) storage solution

Very large matrices using Python and NumPy

A matmul subdivision solution:

https://en.wikipedia.org/wiki/Strassen_algorithm
http://facultyfp.salisbury.edu/taanastasio/COSC490/Fall03/Lectures/FoxMM/example.pdf
http://eli.thegreenplace.net/2012/01/16/python-parallelizing-cpu-bound-tasks-with-multiprocessing/

Thanks in advance for any recommendations, comments, or guidance!

354

asked Jan 12 '13 17:01

ct_

1 Answers

Since you are after the product of a matrix with its transpose, the value at [m, n] is basically going to be the dot product of columns m and n in your original matrix.

I am going to use the following matrix as a toy example

a = np.array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
              [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
              [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1]])
>>> np.dot(a.T, a)
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 2]])

It is of shape (3, 12) and has 7 non-zero entries. The product of its transpose with it is of course of shape (12, 12) and has 16 non-zero entries, 6 of it in the diagonal, so it only requires storage of 11 elements.

You can get a good idea of what the size of your output matrix is going to be in one of two ways:

CSR FORMAT

If your original matrix has C non-zero columns, your new matrix will have at most C**2 non-zero entries, of which C are in the diagonal, and are assured not to be zero, and of the remaining entries you only need to keep half, so that is at most (C**2 + C) / 2 non-zero elements. Of course, many of these will also be zero, so this is probably a gross overestimate.

If your matrix is stored in csr format, then the indices attribute of the corresponding scipy object has an array with the column indices of all non zero elements, so you can easily compute the above estimate as:

>>> a_csr = scipy.sparse.csr_matrix(a)
>>> a_csr.indices
array([ 2, 11,  1,  7, 10,  4, 11])
>>> np.unique(a_csr.indices).shape[0]
6

So there are 6 columns with non-zero entries, and so the estimate would be for at most 36 non-zero entries, way more than the real 16.

CSC FORMAT

If instead of column indices of non-zero elements we have row indices, we can actually do a better estimate. For the dot product of two columns to be non-zero, they must have a non-zero element in the same row. If there are R non-zero elements in a given row, they will contribute R**2 non-zero elements to the product. When you sum this for all rows, you are bound to count some elements more than once, so this is also an upper bound.

The row indices of the non-zero elements of your matrix are in the indices attribute of a sparse csc matrix, so this estimate can be computed as follows:

>>> a_csc = scipy.sparse.csc_matrix(a)
>>> a_csc.indices
array([1, 0, 2, 1, 1, 0, 2])
>>> rows, where = np.unique(a_csc.indices, return_inverse=True)
>>> where = np.bincount(where)
>>> rows
array([0, 1, 2])
>>> where
array([2, 3, 2])
>>> np.sum(where**2)
17

This is darn close to the real 16! And it is actually not a coincidence that this estimate is actually the same as:

>>> np.sum(np.dot(a.T,a),axis=None)
17

In any case, the following code should allow you to see that the estimation is pretty good:

def estimate(a) :
    a_csc = scipy.sparse.csc_matrix(a)
    _, where = np.unique(a_csc.indices, return_inverse=True)
    where = np.bincount(where)
    return np.sum(where**2)

def test(shape=(10,1000), count=100) :
    a = np.zeros(np.prod(shape), dtype=int)
    a[np.random.randint(np.prod(shape), size=count)] = 1
    print 'a non-zero = {0}'.format(np.sum(a))
    a = a.reshape(shape)
    print 'a.T * a non-zero = {0}'.format(np.flatnonzero(np.dot(a.T,
                                                                a)).shape[0])
    print 'csc estimate = {0}'.format(estimate(a))

>>> test(count=100)
a non-zero = 100
a.T * a non-zero = 1065
csc estimate = 1072
>>> test(count=200)
a non-zero = 199
a.T * a non-zero = 4056
csc estimate = 4079
>>> test(count=50)
a non-zero = 50
a.T * a non-zero = 293
csc estimate = 294

136

answered Nov 14 '22 23:11

Jaime

Related questions
                            
                                What Python regex matches all alphabet characters but no numbers? [unicode aware] [duplicate]
                            
                                Which data structure should I use for geocoding?
                            
                                python script to record online live streaming videos
                            
                                Python code jailing
                            
                                Is it a good practice for a module to recursively import its submodules?
                            
                                Where do I put cleanup code in a Flask application? [duplicate]
                            
                                Why can't I assign an arbitrary iterable to an extended slice whose step is -1?
                            
                                Getting a 404 on /wd/hub/session when I try to connect to selenium grid remotely via Python
                            
                                pyopengl buffer dynamic read from numpy array
                            
                                Python Regex - Identifying the first and last items in a list
                            
                                Changing the "locale preferred encoding"
                            
                                Dynamically Changing Celery Beat Settings
                            
                                Intermittent issues while accessing external http services using gevent
                            
                                Why is Pexpect intermittently hanging (not detecting EOF) after executing certain commands?
                            
                                Check if variable is defined with Numpy array?
                            
                                Django SECRET_KEY risks [closed]
                            
                                Assigning values to array slices is slow
                            
                                How fill a regex string with parameters
                            
                                SQLALchemy DB Session with Flask, Postgres
                            
                                How to get POST variables in Python, when using gevent?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

numpy matrix multiplication to triangular/sparse storage?

Tags:

python

numpy

combinatorics

matrix-multiplication

sparse-matrix

ct_

People also ask

1 Answers

Jaime

Recent Activity

Donate For Us