I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)
you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)]
and you want to have the mean income per job
I did this function, and in the example it should be called as aggregate(data,'job','income',mean)
def aggregate(data, key, value, func):
data_per_key = {}
for k,v in zip(data[key], data[value]):
if k not in data_per_key.keys():
data_per_key[k]=[]
data_per_key[k].append(v)
return [(k,func(data_per_key[k])) for k in data_per_key.keys()]
the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?
Thanks for your answer Louis
PS: I would like to keep the func in the call so that you can also ask for median, minimum...
Appending to numpy arrays is very inefficient. This is because the interpreter needs to find and assign memory for the entire array at every single step. Depending on the application, there are much better strategies. If you know the length in advance, it is best to pre-allocate the array using a function like np.
array(a) . List append is faster than array append .
NumPy doesn't do this, so the challenge is to present the same interface as NumPy without explicitly using lazy evaluation.
In general it is better/faster to iterate or append with lists, and apply the np. array (or concatenate) just once. appending to a list is fast; much faster than making a new array.
Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:
import matplotlib.mlab
data=np.array(
[('Aaron','Digger',1),
('Bill','Planter',2),
('Carl','Waterer',3),
('Darlene','Planter',3),
('Earl','Digger',7)],
dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])
result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))
yields
('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)
matplotlib.mlab.rec_groupby
returns a recarray:
print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]
You may also be interested in checking out pandas, which has even more versatile facilities for handling group-by operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With