Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ignoring duplicate entries in sparse matrix

I've tried to initialize csc_matrix and csr_matrix from a list of (data, (rows, cols)) values as the documentation suggests.

sparse = csc_matrix((data, (rows, cols)), shape=(n, n))

The problem is that, the method that I actually have for generating the data, rows and cols vectors introduces duplicates for some points. By default, scipy adds the values of the duplicate entries. However, in my case, those duplicates have exactly the same value in data for a given (row, col).

What I'm trying to achieve is to make scipy ignore the second entry if already exists one, instead of adding them.

Ignoring the fact that I could improve the generation algorithm to avoid generating duplicates, is there a parameter or another way of creating a sparse matrix that ignores duplicates?

Currently two entries with data = [4, 4]; cols = [1, 1]; rows = [1, 1]; generate a sparse matrix which value at (1,1) is 8 while the desired value is 4.

>>> c = csc_matrix(([4, 4], ([1,1],[1,1])), shape=(3,3))
>>> c.todense()
matrix([[0, 0, 0],
        [0, 8, 0],
        [0, 0, 0]])

I'm also aware that I could filter them by using a 2-dimensional numpy unique function, but lists are quite large so this is not really a valid option.

Other possible answer to the question: Is there any way of specifying what to do with duplicates? i.e. keeping the min or max instead of the default sum?

like image 689
Imanol Luengo Avatar asked Feb 23 '15 15:02

Imanol Luengo


2 Answers

Creating an intermediary dok matrix works in your example:

In [410]: c=sparse.coo_matrix((data, (cols, rows)),shape=(3,3)).todok().tocsc()

In [411]: c.A
Out[411]: 
array([[0, 0, 0],
       [0, 4, 0],
       [0, 0, 0]], dtype=int32)

A coo matrix puts your input arrays into its data,col,row attributes without change. The summing doesn't occur until it is converted to a csc.

todok loads the dictionary directly from the coo attributes. It creates the blank dok matrix, and fills it with:

dok.update(izip(izip(self.row,self.col),self.data))

So if there are duplicate (row,col) values, it's the last one that remains. This uses the standard Python dictionary hashing to find the unique keys.


Here's a way of using np.unique. I had to construct a special object array, because unique operates on 1d, and we have a 2d indexing.

In [479]: data, cols, rows = [np.array(j) for j in [[1,4,2,4,1],[0,1,1,1,2],[0,1,2,1,1]]]

In [480]: x=np.zeros(cols.shape,dtype=object)

In [481]: x[:]=list(zip(rows,cols))

In [482]: x
Out[482]: array([(0, 0), (1, 1), (2, 1), (1, 1), (1, 2)], dtype=object)

In [483]: i=np.unique(x,return_index=True)[1]

In [484]: i
Out[484]: array([0, 1, 4, 2], dtype=int32)

In [485]: c1=sparse.csc_matrix((data[i],(cols[i],rows[i])),shape=(3,3))

In [486]: c1.A
Out[486]: 
array([[1, 0, 0],
       [0, 4, 2],
       [0, 1, 0]], dtype=int32)

I have no idea which approach is faster.


An alternative way of getting the unique index, as per liuengo's link:

rc = np.vstack([rows,cols]).T.copy()
dt = rc.dtype.descr * 2
i = np.unique(rc.view(dt), return_index=True)[1]

rc has to own its own data in order to change the dtype with view, hence the .T.copy().

In [554]: rc.view(dt)
Out[554]: 
array([[(0, 0)],
       [(1, 1)],
       [(2, 1)],
       [(1, 1)],
       [(1, 2)]], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])
like image 125
hpaulj Avatar answered Nov 07 '22 05:11

hpaulj


Since the values in your data at repeating (row, col) are the same, you can get the unique rows, columns and values as follows:

rows, cols, data = zip(*set(zip(rows, cols, data)))

Example:

data = [4, 3, 4]
cols = [1, 2, 1]
rows = [1, 3, 1]

csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()

matrix([[0, 0, 0, 0],
        [0, 8, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 3, 0]])



rows, cols, data = zip(*set(zip(rows, cols, data)))
csc_matrix((data, (rows, cols)), shape=(4, 4)).todense()

matrix([[0, 0, 0, 0],
        [0, 4, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 3, 0]])
like image 39
Andreas K. Avatar answered Nov 07 '22 05:11

Andreas K.