Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a huge sparse matrix in scipy

I am trying to create a very huge sparse matrix which has a shape (447957347, 5027974). And, it contains 3,289,288,566 elements.

But, when i create a csr_matrix using scipy.sparse, it return something like this:

<447957346x5027974 sparse matrix of type '<type 'numpy.uint32'>'
    with -1005678730 stored elements in Compressed Sparse Row format>

The source code for creating matrix is:

indptr = np.array(a, dtype=np.uint32)    # a is a python array('L') contain row index information
indices = np.array(b, dtype=np.uint32)   # b is  a python array('L') contain column index information
data = np.ones((len(indices),), dtype=np.uint32)
test = csr_matrix((data,indices,indptr), shape=(len(indptr)-1, 5027974), dtype=np.uint32)

And, I also found when I convert an 3 billion length python array to numpy array, it will raise an error:

ValueError:setting an array element with a sequence

But, when I create three 1 billion length python arrays, and convert them to numpy array, then append them. It works fine.

I'm confused.

like image 568
Ofey Avatar asked Apr 30 '14 06:04

Ofey


People also ask

What is the Scipy function which creates a sparse matrix?

from scipy.sparse import csc_matrix. # Creating a 3 * 4 sparse matrix. sparseMatrix = csc_matrix(( 3 , 4 ), dtype = np.int8).toarray() # Print the sparse matrix.

How do you construct a sparse matrix?

S = sparse( A ) converts a full matrix into sparse form by squeezing out any zero elements. If a matrix contains many zeros, converting the matrix to sparse storage saves memory. S = sparse( m,n ) generates an m -by- n all zero sparse matrix.

What will be the memory space for sparse matrix of size 5 * 4?

The above matrix occupies 5x4 = 20 memory space. Increasing the size of matrix will increase the wastage space. In the above structure, first column represents the rows, the second column represents the columns, and the third column represents the non-zero value.

What is sparse matrix in Scipy?

Matrices that mostly contain zeroes are said to be sparse. Sparse matrices are commonly used in applied machine learning (such as in data containing data-encodings that map categories to count) and even in whole subfields of machine learning such as natural language processing (NLP).


1 Answers

You are using an older version of SciPy. In the original implementation of sparse matrices, indices where stored in an int32 variable, even on 64 bit systems. Even if you define them to be uint32, as you did, they get casted. So whenever your matrix has more than 2^31 - 1 nonzero entries, as is your case, the indexing overflows and lots of bad things happen. Note that in your case the weird negative number of elements is explained by:

>>> np.int32(np.int64(3289288566))
-1005678730

The good news is that this has already been figured out. I think this is the relevant PR, although there were some more fixes after that one. In any case, if you use the latest release candidate for SciPy 0.14, your problem should be gone.

like image 71
Jaime Avatar answered Sep 25 '22 18:09

Jaime