Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Average duplicate values from two paired lists in Python using NumPy

In the past I have faced myself dealing with averaging two paired lists and I have used the answers provided there successfully.

However with large (more than 20,000) items the procedure is somewhat slow, and I was wondering if using NumPy would make it faster.

I start from two lists, one of floats and one of strings:

names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]

I'm trying to calculate the mean of the identical values, so that after applying it, I'd get:

result_names = ["a", "b", "c", "d", "e"]
result_values = [1.2, 4.4, 2.0, 5.67, 8.54]

I put two lists as a result example, but having also a list of (name, value) tuples would suffice:

result = [("a", 1.2), ("b", 4.4), ("d", 5.67), ("e", 8.54)]

What's the best way to do this with NumPy?

like image 588
Einar Avatar asked Oct 17 '11 07:10

Einar


People also ask

How do you average two numpy arrays in Python?

Finding average of NumPy arrays is quite similar to finding average of given numbers. We just have to get the sum of corresponding array elements and then divide that sum with the total number of arrays.

How do you find the average in numpy?

mean() Arithmetic mean is the sum of elements along an axis divided by the number of elements. The numpy. mean() function returns the arithmetic mean of elements in the array.

How is NP mean () different from NP average () in numpy?

np. mean always computes an arithmetic mean, and has some additional options for input and output (e.g. what datatypes to use, where to place the result). np. average can compute a weighted average if the weights parameter is supplied.

How do you find the average of each column in numpy?

To calculate the average separately for each column of the 2D array, use the function call np. average(matrix, axis=0) setting the axis argument to 0.


1 Answers

With numpy you can write something yourself, or you can use groupby functionality (the rec_groupby function from matplotlib.mlab, but which is much slower. For more powerful groupby functionality, maybe look at pandas), and I compared it with the answer of Michael Dunn with a dictionary:

import numpy as np
import random
from matplotlib.mlab import rec_groupby

listA = [random.choice("abcdef") for i in range(20000)]
listB = [20 * random.random() for i in range(20000)]

names = np.array(listA)
values = np.array(listB)

def f_dict(listA, listB):
    d = {}

    for a, b in zip(listA, listB):
        d.setdefault(a, []).append(b)

    avg = []
    for key in d:
        avg.append(sum(d[key])/len(d[key]))

    return d.keys(), avg

def f_numpy(names, values):
    result_names = np.unique(names)
    result_values = np.empty(result_names.shape)

    for i, name in enumerate(result_names):
        result_values[i] = np.mean(values[names == name])

    return result_names, result_values     

This is the result for the three:

In [2]: f_dict(listA, listB)
Out[2]: 
(['a', 'c', 'b', 'e', 'd', 'f'],
 [9.9003182717213765,
  10.077784850173568,
  9.8623915728699636,
  9.9790599744319319,
  9.8811096512807097,
  10.118695410115953])

In [3]: f_numpy(names, values)
Out[3]: 
(array(['a', 'b', 'c', 'd', 'e', 'f'], 
      dtype='|S1'),
 array([  9.90031827,   9.86239157,  10.07778485,   9.88110965,
         9.97905997,  10.11869541]))

In [7]: rec_groupby(struct_array, ('names',), (('values', np.mean, 'resvalues'),))
Out[7]: 
rec.array([('a', 9.900318271721376), ('b', 9.862391572869964),
       ('c', 10.077784850173568), ('d', 9.88110965128071),
       ('e', 9.979059974431932), ('f', 10.118695410115953)], 
      dtype=[('names', '|S1'), ('resvalues', '<f8')])

And it seems that numpy is a little bit faster for this test (and the pre-defined groupby function much slower):

In [32]: %timeit f_dict(listA, listB)
10 loops, best of 3: 23 ms per loop

In [33]: %timeit f_numpy(names, values)
100 loops, best of 3: 9.78 ms per loop

In [8]: %timeit rec_groupby(struct_array, ('names',), (('values', np.mean, 'values'),))
1 loops, best of 3: 203 ms per loop
like image 112
joris Avatar answered Oct 03 '22 08:10

joris