How would I go about applying an aggregating function (such as "sum()" or "max()") to bins in a vector.
That is if I have:
such that b indicates to what bin each value in x belongs. for every possible value in b a I want to apply the aggregating function "func()" on all the values of x that belong to that bin.
>> x = [1,2,3,4,5,6]
>> b = ["a","b","a","a","c","c"]
the output should be 2 vectors (say the aggregating function is the product function):
>>(labels, y) = apply_to_bins(values = x, bins = b, func = prod)
labels = ["a","b","c"]
y = [12, 2, 30]
I want to do this as elegantly as possible in numpy (or just python), since obviously I could just "for loop" over it.
With pandas groupby this would be
import pandas as pd
def with_pandas_groupby(func, x, b):
grouped = pd.Series(x).groupby(b)
return grouped.agg(func)
Using the example of the OP:
>>> x = [1,2,3,4,5,6]
>>> b = ["a","b","a","a","c","c"]
>>> with_pandas_groupby(np.prod, x, b)
a 12
b 2
c 30
I was just interessted in the speed and so I compared with_pandas_groupby with some functions given in the answer of senderle.
apply_to_bins_groupby
3 levels, 100 values: 175 us per loop
3 levels, 1000 values: 1.16 ms per loop
3 levels, 1000000 values: 1.21 s per loop
10 levels, 100 values: 304 us per loop
10 levels, 1000 values: 1.32 ms per loop
10 levels, 1000000 values: 1.23 s per loop
26 levels, 100 values: 554 us per loop
26 levels, 1000 values: 1.59 ms per loop
26 levels, 1000000 values: 1.27 s per loop
apply_to_bins3
3 levels, 100 values: 136 us per loop
3 levels, 1000 values: 259 us per loop
3 levels, 1000000 values: 205 ms per loop
10 levels, 100 values: 297 us per loop
10 levels, 1000 values: 447 us per loop
10 levels, 1000000 values: 262 ms per loop
26 levels, 100 values: 617 us per loop
26 levels, 1000 values: 795 us per loop
26 levels, 1000000 values: 299 ms per loop
with_pandas_groupby
3 levels, 100 values: 365 us per loop
3 levels, 1000 values: 443 us per loop
3 levels, 1000000 values: 89.4 ms per loop
10 levels, 100 values: 369 us per loop
10 levels, 1000 values: 453 us per loop
10 levels, 1000000 values: 88.8 ms per loop
26 levels, 100 values: 382 us per loop
26 levels, 1000 values: 466 us per loop
26 levels, 1000000 values: 89.9 ms per loop
So pandas is the fastest for large item size. Further more the number of levels (bins) has no big influence on computation time.
(Note that the time is calculated starting from numpy arrays and the time to create the pandas.Series is included)
I generated the data with:
def gen_data(levels, size):
choices = 'abcdefghijklmnopqrstuvwxyz'
levels = np.asarray([l for l in choices[:nlevels]])
index = np.random.random_integers(0, levels.size - 1, size)
b = levels[index]
x = np.arange(1, size + 1)
return x, b
And then run the benchmark in ipython like this:
In [174]: for nlevels in (3, 10, 26):
.....: for size in (100, 1000, 10e5):
.....: x, b = gen_data(nlevels, size)
.....: print '%2d levels, ' % nlevels, '%7d values:' % size,
.....: %timeit function_to_time(np.prod, x, b)
.....: print
import itertools as it
import operator as op
def apply_to_bins(values, bins, func):
return {k: func(x[1] for x in v) for k,v in it.groupby(sorted(zip(bins, values), key=op.itemgetter(0)), key=op.itemgetter(0))}
x = [1,2,3,4,5,6]
b = ["a","b","a","a","c","c"]
print apply_to_bins(x, b, sum) # returns {'a': 8, 'b': 2, 'c': 11}
print apply_to_bins(x, b, max) # returns {'a': 4, 'b': 2, 'c': 6}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With