I am constructing a sparse vector using a scipy.sparse.csr_matrix like so:
csr_matrix((values, (np.zeros(len(indices)), indices)), shape = (1, max_index))
This works fine for most of my data, but occasionally I get a ValueError: could not convert integer scalar. 
This reproduces the problem:
In [145]: inds
Out[145]:
array([ 827969148,  996833913, 1968345558,  898183169, 1811744124,
        2101454109,  133039182,  898183170,  919293479,  133039089])
In [146]: vals
Out[146]:
array([ 1.,  1.,  1.,  1.,  1.,  2.,  1.,  1.,  1.,  1.])
In [147]: max_index
Out[147]:
2337713000
In [143]: csr_matrix((vals, (np.zeros(10), inds)), shape = (1, max_index+1))
...
    996         fn = _sparsetools.csr_sum_duplicates
    997         M,N = self._swap(self.shape)
--> 998         fn(M, N, self.indptr, self.indices, self.data)
    999 
    1000         self.prune()  # nnz may have changed
ValueError: could not convert integer scalar
inds is a np.int64 array and vals is a np.float64 array.
The relevant part of the scipy sum_duplicates code is here.
Note that this works:
In [235]: csr_matrix(([1,1], ([0,0], [1,2])), shape = (1, 2**34))
Out[235]:
<1x17179869184 sparse matrix of type '<type 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>
So the problem is not that one of the dimensions is > 2^31
Any thoughts why these values should be causing a problem?
Might it be that max_index > 2**31 ? Try this, just to make sure:
csr_matrix((vals, (np.zeros(10), inds/2)), shape = (1, max_index/2))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With