Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying a function by bins on a vector in Numpy

Tags:

python

numpy

How would I go about applying an aggregating function (such as "sum()" or "max()") to bins in a vector.

That is if I have:

  1. a vector of values x of length N
  2. a vector of bin tags b of length N

such that b indicates to what bin each value in x belongs. for every possible value in b a I want to apply the aggregating function "func()" on all the values of x that belong to that bin.

>> x = [1,2,3,4,5,6]
>> b = ["a","b","a","a","c","c"]    

the output should be 2 vectors (say the aggregating function is the product function):

>>(labels, y) = apply_to_bins(values = x, bins = b, func = prod)

labels = ["a","b","c"]
y = [12, 2, 30]

I want to do this as elegantly as possible in numpy (or just python), since obviously I could just "for loop" over it.

like image 697
eran Avatar asked Dec 11 '25 12:12

eran


2 Answers

With pandas groupby this would be

import pandas as pd

def with_pandas_groupby(func, x, b):
    grouped = pd.Series(x).groupby(b)
    return grouped.agg(func)

Using the example of the OP:

>>> x = [1,2,3,4,5,6]
>>> b = ["a","b","a","a","c","c"]
>>> with_pandas_groupby(np.prod, x, b)
a    12
b     2
c    30

I was just interessted in the speed and so I compared with_pandas_groupby with some functions given in the answer of senderle.

  • apply_to_bins_groupby

     3 levels,      100 values: 175 us per loop
     3 levels,     1000 values: 1.16 ms per loop
     3 levels,  1000000 values: 1.21 s per loop
    
    10 levels,      100 values: 304 us per loop
    10 levels,     1000 values: 1.32 ms per loop
    10 levels,  1000000 values: 1.23 s per loop
    
    26 levels,      100 values: 554 us per loop
    26 levels,     1000 values: 1.59 ms per loop
    26 levels,  1000000 values: 1.27 s per loop
    
  • apply_to_bins3

     3 levels,      100 values: 136 us per loop
     3 levels,     1000 values: 259 us per loop
     3 levels,  1000000 values: 205 ms per loop
    
    10 levels,      100 values: 297 us per loop
    10 levels,     1000 values: 447 us per loop
    10 levels,  1000000 values: 262 ms per loop
    
    26 levels,      100 values: 617 us per loop
    26 levels,     1000 values: 795 us per loop
    26 levels,  1000000 values: 299 ms per loop
    
  • with_pandas_groupby

     3 levels,      100 values: 365 us per loop
     3 levels,     1000 values: 443 us per loop
     3 levels,  1000000 values: 89.4 ms per loop
    
    10 levels,      100 values: 369 us per loop
    10 levels,     1000 values: 453 us per loop
    10 levels,  1000000 values: 88.8 ms per loop
    
    26 levels,      100 values: 382 us per loop
    26 levels,     1000 values: 466 us per loop
    26 levels,  1000000 values: 89.9 ms per loop
    

So pandas is the fastest for large item size. Further more the number of levels (bins) has no big influence on computation time. (Note that the time is calculated starting from numpy arrays and the time to create the pandas.Series is included)

I generated the data with:

def gen_data(levels, size):
    choices = 'abcdefghijklmnopqrstuvwxyz'
    levels = np.asarray([l for l in choices[:nlevels]])
    index = np.random.random_integers(0, levels.size - 1, size)
    b = levels[index]
    x = np.arange(1, size + 1)
    return x, b

And then run the benchmark in ipython like this:

In [174]: for nlevels in (3, 10, 26):
   .....:     for size in (100, 1000, 10e5):
   .....:         x, b = gen_data(nlevels, size)
   .....:         print '%2d levels, ' % nlevels, '%7d values:' % size,
   .....:         %timeit function_to_time(np.prod, x, b)
   .....:     print
like image 159
bmu Avatar answered Dec 14 '25 02:12

bmu


import itertools as it
import operator as op

def apply_to_bins(values, bins, func):
    return {k: func(x[1] for x in v) for k,v in it.groupby(sorted(zip(bins, values), key=op.itemgetter(0)), key=op.itemgetter(0))}

x = [1,2,3,4,5,6]
b = ["a","b","a","a","c","c"]   

print apply_to_bins(x, b, sum) # returns {'a': 8, 'b': 2, 'c': 11}
print apply_to_bins(x, b, max) # returns {'a': 4, 'b': 2, 'c': 6}
like image 37
eumiro Avatar answered Dec 14 '25 03:12

eumiro



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!