Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to extract dictionary of sums in numpy in 1 I/O pass

Let's say I have an array like:

arr = np.array([[1,20,5],
                [1,20,8],
                [3,10,4],
                [2,30,6],
                [3,10,5]])

and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.

Approach 1: Naively, I can invoke np.unique() then iterate through each item in the unique array with a combination of np.where() and np.sum() in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.

Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.

Is there a fast way to implement Approach 2 without resorting to a pure Python loop?

like image 267
Katie Avatar asked Jul 21 '16 06:07

Katie


4 Answers

numpy approach:

u = np.unique(arr[:, 0])
s = ((arr[:, [0]] == u) * arr[:, [2]]).sum(0)

dict(np.stack([u, s]).T)

{1: 13, 2: 6, 3: 9}

pandas approach:

import pandas as pd
import numpy as np

pd.DataFrame(arr, columns=list('ABC')).groupby('A').C.sum().to_dict()

{1: 13, 2: 6, 3: 9}

enter image description here

like image 51
piRSquared Avatar answered Oct 20 '22 06:10

piRSquared


Here's a NumPy based approach using np.add.reduceat -

sidx = arr[:,0].argsort()
idx = np.append(0,np.where(np.diff(arr[sidx,0])!=0)[0]+1)
keys = arr[sidx[idx],0]
vals = np.add.reduceat(arr[sidx,2],idx,axis=0)

If you would like to get the keys and values in a 2-column array -

out = np.column_stack((keys,vals)) # If you 

Sample run -

In [351]: arr
Out[351]: 
array([[ 1, 20,  5],
       [ 1, 20,  8],
       [ 3, 10,  4],
       [ 2, 30,  6],
       [ 3, 10,  5]])

In [352]: out
Out[352]: 
array([[ 1, 13],
       [ 2,  6],
       [ 3,  9]])
like image 26
Divakar Avatar answered Oct 20 '22 04:10

Divakar


This is a typical grouping problem, which the numpy_indexed package solves efficiently and elegantly (if I may say so myself; I am its author)

import numpy_indexed as npi
npi.group_by(arr[:, 0]).sum(arr[:, 2])

Its a much more lightweight solution than the pandas package, and I think the syntax is cleaner, as there is no need to create a special datastructure just to perform this type of elementary operation. Performance should be identical to the solution posed by Divakar, since it follows the same steps; just with a nice and tested interface on top.

like image 3
Eelco Hoogendoorn Avatar answered Oct 20 '22 04:10

Eelco Hoogendoorn


The proper way of doing this is with NumPy is using np.bincount. If your unique first column labels are already small contiguous integers you could simply do:

cum_sums = np.bincount(arr[:, 0], weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in enumerate(cum_sums)
            if cum_sum != 0}

where the cum_sum != 0 is an attempt to skip missing first column labels, which may be grossly wrong if your third column includes negative numbers.

Alternatively you can do things properly and call np.unique first and do:

uniques, indices = np.unique(arr[:, 0], return_inverse=True)
cum_sums = np.bincount(indices, weights=arr[:, 2])
cum_dict = {index: cum_sum for index, cum_sum in zip(uniques, cum_sums)}
like image 3
Jaime Avatar answered Oct 20 '22 04:10

Jaime