In the past I have faced myself dealing with averaging two paired lists and I have used the answers provided there successfully.
However with large (more than 20,000) items the procedure is somewhat slow, and I was wondering if using NumPy would make it faster.
I start from two lists, one of floats and one of strings:
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
I'm trying to calculate the mean of the identical values, so that after applying it, I'd get:
result_names = ["a", "b", "c", "d", "e"]
result_values = [1.2, 4.4, 2.0, 5.67, 8.54]
I put two lists as a result example, but having also a list of (name, value)
tuples would suffice:
result = [("a", 1.2), ("b", 4.4), ("d", 5.67), ("e", 8.54)]
What's the best way to do this with NumPy?
Finding average of NumPy arrays is quite similar to finding average of given numbers. We just have to get the sum of corresponding array elements and then divide that sum with the total number of arrays.
mean() Arithmetic mean is the sum of elements along an axis divided by the number of elements. The numpy. mean() function returns the arithmetic mean of elements in the array.
np. mean always computes an arithmetic mean, and has some additional options for input and output (e.g. what datatypes to use, where to place the result). np. average can compute a weighted average if the weights parameter is supplied.
To calculate the average separately for each column of the 2D array, use the function call np. average(matrix, axis=0) setting the axis argument to 0.
With numpy you can write something yourself, or you can use groupby functionality (the rec_groupby function from matplotlib.mlab, but which is much slower. For more powerful groupby functionality, maybe look at pandas), and I compared it with the answer of Michael Dunn with a dictionary:
import numpy as np
import random
from matplotlib.mlab import rec_groupby
listA = [random.choice("abcdef") for i in range(20000)]
listB = [20 * random.random() for i in range(20000)]
names = np.array(listA)
values = np.array(listB)
def f_dict(listA, listB):
d = {}
for a, b in zip(listA, listB):
d.setdefault(a, []).append(b)
avg = []
for key in d:
avg.append(sum(d[key])/len(d[key]))
return d.keys(), avg
def f_numpy(names, values):
result_names = np.unique(names)
result_values = np.empty(result_names.shape)
for i, name in enumerate(result_names):
result_values[i] = np.mean(values[names == name])
return result_names, result_values
This is the result for the three:
In [2]: f_dict(listA, listB)
Out[2]:
(['a', 'c', 'b', 'e', 'd', 'f'],
[9.9003182717213765,
10.077784850173568,
9.8623915728699636,
9.9790599744319319,
9.8811096512807097,
10.118695410115953])
In [3]: f_numpy(names, values)
Out[3]:
(array(['a', 'b', 'c', 'd', 'e', 'f'],
dtype='|S1'),
array([ 9.90031827, 9.86239157, 10.07778485, 9.88110965,
9.97905997, 10.11869541]))
In [7]: rec_groupby(struct_array, ('names',), (('values', np.mean, 'resvalues'),))
Out[7]:
rec.array([('a', 9.900318271721376), ('b', 9.862391572869964),
('c', 10.077784850173568), ('d', 9.88110965128071),
('e', 9.979059974431932), ('f', 10.118695410115953)],
dtype=[('names', '|S1'), ('resvalues', '<f8')])
And it seems that numpy is a little bit faster for this test (and the pre-defined groupby function much slower):
In [32]: %timeit f_dict(listA, listB)
10 loops, best of 3: 23 ms per loop
In [33]: %timeit f_numpy(names, values)
100 loops, best of 3: 9.78 ms per loop
In [8]: %timeit rec_groupby(struct_array, ('names',), (('values', np.mean, 'values'),))
1 loops, best of 3: 203 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With