In scipy, to create a sparse matrix from triple format data (row, col and data arrays), the default behavior is to sum the data values for all duplicates. Can I change this behavior to overwrite (or do nothing) instead?
For example:
import scipy.sparse as sparse
rows = [0, 0]
cols = [0, 0]
data = [1, 1]
S = sparse.coo_matrix((data, (rows, cols)))
Here, S.todense()
is equal to matrix([[2]])
but I would wish it to be matrix([[1]])
.
In the documentation of sparse.coo_matrix, it reads
By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like.
It appears from that formulation that there might be other options than the default.
I've seen discussion on the scipy github about giving more control over this summing, but I don't know of any production changes. As the docs indicate, there's a long standing tradition over summing the duplicates.
As created, the coo
matrix does not sum; it just assigns your parameters to its attributes:
In [697]: S = sparse.coo_matrix((data, (rows, cols)))
In [698]: S.data
Out[698]: array([1, 1])
In [699]: S.row
Out[699]: array([0, 0], dtype=int32)
In [700]: S.col
Out[700]: array([0, 0], dtype=int32)
Converting to dense (or to csr/csc) does sum - but doesn't change S
itself:
In [701]: S.A
Out[701]: array([[2]])
In [702]: S.data
Out[702]: array([1, 1])
You can performing the summing inplace with:
In [703]: S.sum_duplicates()
In [704]: S.data
Out[704]: array([2], dtype=int32)
I don't know of a way of either removing the duplicates or bypassing that action. I may look up the relevant issue.
=================
S.todok()
does an inplace sum (that is, changes S
). Looking at that code I see that it calls self.sum_duplicates
. The following replicates that without the sum:
In [727]: dok=sparse.dok_matrix((S.shape),dtype=S.dtype)
In [728]: dok.update(zip(zip(S.row,S.col),S.data))
In [729]: dok
Out[729]:
<1x1 sparse matrix of type '<class 'numpy.int32'>'
with 1 stored elements in Dictionary Of Keys format>
In [730]: print(dok)
(0, 0) 1
In [731]: S
Out[731]:
<1x1 sparse matrix of type '<class 'numpy.int32'>'
with 2 stored elements in COOrdinate format>
In [732]: dok.A
Out[732]: array([[1]])
It's a dictionary update, so the final value is the last of the duplicates. I found elsewhere that dok.update
is a pretty fast way of adding values to a sparse matrix.
tocsr
inherently does the sum; tolil
uses tocsr
; so this todok
approach may be simplest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With