I've tried to initialize csc_matrix
and csr_matrix
from a list of (data, (rows, cols))
values as the documentation suggests.
sparse = csc_matrix((data, (rows, cols)), shape=(n, n))
The problem is that, the method that I actually have for generating the data
, rows
and cols
vectors introduces duplicates for some points. By default, scipy adds the values of the duplicate entries. However, in my case, those duplicates have exactly the same value in data
for a given (row, col)
.
What I'm trying to achieve is to make scipy ignore the second entry if already exists one, instead of adding them.
Ignoring the fact that I could improve the generation algorithm to avoid generating duplicates, is there a parameter or another way of creating a sparse matrix that ignores duplicates?
Currently two entries with data = [4, 4]; cols = [1, 1]; rows = [1, 1];
generate a sparse matrix which value at (1,1)
is 8
while the desired value is 4
.
>>> c = csc_matrix(([4, 4], ([1,1],[1,1])), shape=(3,3))
>>> c.todense()
matrix([[0, 0, 0],
[0, 8, 0],
[0, 0, 0]])
I'm also aware that I could filter them by using a 2-dimensional numpy unique
function, but lists are quite large so this is not really a valid option.
Other possible answer to the question: Is there any way of specifying what to do with duplicates? i.e. keeping the min
or max
instead of the default sum
?
Creating an intermediary dok
matrix works in your example:
In [410]: c=sparse.coo_matrix((data, (cols, rows)),shape=(3,3)).todok().tocsc()
In [411]: c.A
Out[411]:
array([[0, 0, 0],
[0, 4, 0],
[0, 0, 0]], dtype=int32)
A coo
matrix puts your input arrays into its data
,col
,row
attributes without change. The summing doesn't occur until it is converted to a csc
.
todok
loads the dictionary directly from the coo
attributes. It creates the blank dok
matrix, and fills it with:
dok.update(izip(izip(self.row,self.col),self.data))
So if there are duplicate (row,col)
values, it's the last one that remains. This uses the standard Python dictionary hashing to find the unique keys.
Here's a way of using np.unique
. I had to construct a special object array, because unique
operates on 1d, and we have a 2d indexing.
In [479]: data, cols, rows = [np.array(j) for j in [[1,4,2,4,1],[0,1,1,1,2],[0,1,2,1,1]]]
In [480]: x=np.zeros(cols.shape,dtype=object)
In [481]: x[:]=list(zip(rows,cols))
In [482]: x
Out[482]: array([(0, 0), (1, 1), (2, 1), (1, 1), (1, 2)], dtype=object)
In [483]: i=np.unique(x,return_index=True)[1]
In [484]: i
Out[484]: array([0, 1, 4, 2], dtype=int32)
In [485]: c1=sparse.csc_matrix((data[i],(cols[i],rows[i])),shape=(3,3))
In [486]: c1.A
Out[486]:
array([[1, 0, 0],
[0, 4, 2],
[0, 1, 0]], dtype=int32)
I have no idea which approach is faster.
An alternative way of getting the unique index, as per liuengo's
link:
rc = np.vstack([rows,cols]).T.copy()
dt = rc.dtype.descr * 2
i = np.unique(rc.view(dt), return_index=True)[1]
rc
has to own its own data in order to change the dtype with view, hence the .T.copy()
.
In [554]: rc.view(dt)
Out[554]:
array([[(0, 0)],
[(1, 1)],
[(2, 1)],
[(1, 1)],
[(1, 2)]],
dtype=[('f0', '<i4'), ('f1', '<i4')])
Since the values in your data
at repeating (row, col) are the same, you can get the unique rows, columns and values as follows:
rows, cols, data = zip(*set(zip(rows, cols, data)))
Example:
data = [4, 3, 4]
cols = [1, 2, 1]
rows = [1, 3, 1]
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()
matrix([[0, 0, 0, 0],
[0, 8, 0, 0],
[0, 0, 0, 0],
[0, 0, 3, 0]])
rows, cols, data = zip(*set(zip(rows, cols, data)))
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()
matrix([[0, 0, 0, 0],
[0, 4, 0, 0],
[0, 0, 0, 0],
[0, 0, 3, 0]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With