Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using numpy.bincount with array weights

Tags:

python

numpy

I would like to use bincount to sum arrays, however it supports only doubles. For example this works:

np.bincount([1, 1, 0],weights=np.array([1, 2, 4]))
Out: array([ 4.,  3.])

However I would like to use a dimension 2 array as:

np.bincount([1, 1, 0],weights=np.array([[1,1], [2,2], [4,4]]))
ValueError: object too deep for desired array

The desired output is:

Out: array([[ 4.,  4.],[3., 3.]])

Better explanation after comments:

I want to sum each row of the array into the corresponding index.

With a loop it would be:

Bin=np.zeros(2,2)
for i in [1,1,0]:
    Bin[i]+=a[i]

a is previous 3x2 matrix Is there an efficient way to obtain this result?

like image 359
Andrea Zonca Avatar asked Mar 05 '11 17:03

Andrea Zonca


3 Answers

As per the numpy documentation:

numpy.bincount(x, weights=None, minlength=None)

weights : array_like, optional; Weights, array of the same shape as x.

So you can't use bincount directly in this fashion unless you alter x somehow.

Edit: So I came up with a slightly tricky way of doing this, but no guarantees about the performance when you go to large arrays. Basically I'm going to leverage how scipy sparse matrices handle repeated entries at the same indices (they sum them):

 from scipy.sparse import *
 a = np.array([[1,1], [2,2], [4,4]])
 ii = np.array([1, 1, 0])

 ares = a.reshape((-1,),order='F')
 # ares == array([1, 2, 4, 1, 2, 4])

 col = np.tile(ii,(a.shape[1],))
 # col == np.array([1, 1, 0, 1, 1, 0])

 row = np.tile([0,1],(a.shape[0],1)).reshape((-1,),order='F') 
 # row == np.array([0,0,0,1,1,1]) 

 g = coo_matrix((ares,(col,row)),shape=(2,2))
 print g.todense()     

Now you're going to have to generalize this to your precise data. The basic idea is that you want to map each data point to the correct element of your results array and then let the sparse array handle summing the duplicate entries.

Otherwise, I would look into using Cython if you are forced to use looping to solve this.

Edit 2: For kicks, I timed two different methods:

import numpy as np
from scipy.sparse import *

def method1():
    return np.array([np.bincount(ii, r) for r in a.T]).T

def method2():
    ares = a.reshape((-1,),order='F')
    col = np.tile(ii,(a.shape[1],))
    row = np.tile(np.arange(a.shape[1]),(a.shape[0],1)).reshape((-1,),order='F') 

    return coo_matrix((ares,(col,row)),shape=(np.unique(ii).size,a.shape[1])).todense()

if __name__ == '__main__':
    from timeit import Timer

    a = np.random.randint(0,1000,(1000000,3))
    ii = np.random.randint(0,10,(a.shape[0],))

    N = 100
    t1 = Timer("method1()", "from __main__ import method1")
    t2 = Timer("method2()", "from __main__ import method2")
    print 't2/t1: %f' % (t2.timeit(N)/t1.timeit(N))

On my machine, method2 is about 3-5x slower than method1 depending on the shape of the inputs so looping isn't necessarily a bad option.

like image 172
JoshAdel Avatar answered Nov 08 '22 00:11

JoshAdel


You should use scipy csr matrix to represent the indices, and then a dot product with your data. On my laptop it's 14x faster than @JoshAdel's method1 and 54x faster than @JoshAdel's method2 for big matrices.

def method1():
    return np.array([np.bincount(ii, r) for r in a.T]).T

def method2():
    ares = a.reshape((-1,),order='F')
    col = np.tile(ii,(a.shape[1],))
    row = np.tile(np.arange(a.shape[1]),(a.shape[0],1)).reshape((-1,),order='F') 

    return coo_matrix((ares,(col,row)),shape=(ii.max()+1,a.shape[1])).todense()

def method3():
    csr = csr_matrix((np.ones(ii.shape[0]), (ii, np.arange(ii.shape[0]))))
    return csr*a

Let's generate random data and time them:

n = 1<<18
d = 512
ii = np.random.randint(low=1, high=1<<10, size=n)
a = np.random.randn((n, d))

%timeit method1()
# 1 loop, best of 3: 3.13 s per loop

%timeit method2()
# 1 loop, best of 3: 11.7 s per loop

%timeit method3()
# 1 loop, best of 3: 216 ms per loop

# sanity checks:
assert (method1() == method2()).all()
assert (method1() == method3()).all()

note: I replaced np.unique(ii).size in method2 by ii.max()+1

like image 37
M1L0U Avatar answered Nov 08 '22 02:11

M1L0U


I think this is a simpler version for the thing you asked, based on bincount. In essence you break your weights by column in order to have the same size as your initial array and then stack the different columns together.

a = np.array([1, 1, 0])
b = np.array([[1,1], [2,2], [4,4]])
uni = np.unique(a)

a_x = np.bincount(a,  weights=b[:,0], minlength=len(uni))
a_y = np.bincount(a,  weights=b[:,1], minlength=len(uni))

final = np.column_stack((a_x.T, a_y.T))
# final = np.array([[ 4.,  4.],[3., 3.]])
like image 1
MariosN3 Avatar answered Nov 08 '22 02:11

MariosN3