I have a sparse matrix that I arrived at through a complicated bunch of calculations which I cannot reproduce here. I will try to find a simpler example of this.
For now, does anyone know how it might be (even remotely) possible that I could have a sparse matrix X
with the property that:
In [143]: X.sum(0).sum()
Out[143]: 131138
In [144]: X.sum()
Out[144]: 327746
In [145]: X.sum(1).sum()
Out[145]: 327746
In [146]: type(X)
Out[146]: scipy.sparse.csr.csr_matrix
My only guess is that if I want to sum columns correctly, I need to first cast the matrix as csc -- which makes sense. Although one would think that the sparse package would handle column sums gracefully (or throw an error) instead of just giving a WRONG answer.
After more thought, I tried the following:
In [164]: X.tocsr().sum(0).sum()
Out[164]: 131138
In [165]: X.tocsc().sum(0).sum()
Out[165]: 131138
In [166]: X.tocoo().sum(0).sum()
Out[166]: 131138
In [167]: X.tolil().sum(0).sum()
Out[167]: 131138
In [168]: X.todok().sum(0).sum()
Out[168]: 131138
In [169]: X.shape
Out[169]: (196980, 43)
In [170]: X
Out[170]:
<196980x43 sparse matrix of type '<type 'numpy.uint16'>'
with 70875 stored elements in Compressed Sparse Row format>
In [172]: X.todense().sum(0)
Out[172]:
matrix([[170726, 1041, 117398, 3526, 13202, 3585, 2355, 1895, 1392, 2189, 2070, 2603, 1676, 496, 1194, 933, 129,
529, 544, 256, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint64)
In [173]: X.sum(0)
Out[173]:
matrix([[39654, 1041, 51862, 3526, 13202, 3585, 2355, 1895, 1392, 2189, 2070, 2603, 1676, 496, 1194, 933, 129, 529, 544, 256,
7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0]], dtype=uint16)
I should add some more context: the matrix has only non-negative entries (they are counts). In particular there were two sparse count matrices A
and B
which I multiplied together to get X
.
Ok, so seberg answered the question. Thanks a lot. Go seberg!
He observed that the data type uint16 might be a problem. Sure enough -- uint16 maxes out around 65,000 and my sums are much bigger than that, even though my individual datapoints are much much smaller than that.
Proof in the pudding:
In [184]: Y = sparse.csc_matrix(X,dtype=np.uint32)
In [185]: Y.sum(0).sum()
Out[185]: 327746
In [187]: Y.sum(0)
Out[187]:
matrix([[170726, 1041, 117398, 3526, 13202, 3585, 2355, 1895, 1392, 2189, 2070, 2603, 1676, 496, 1194, 933, 129,
529, 544, 256, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint32)
This explains the inconsistent sums and, changing the data type, rectifies the issue. Although, still there is the persistent problem that -- if I have a matrix with all small entries, I want to be able to use a smaller datatype for it (to save memory).
This is kind of a separate but related question:
Is there a way to gracefully handle numerical overflow problems when summing columns of sparse matrices?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With