In the past I have faced myself dealing with averaging two paired lists and I have used the answers provided there successfully. However with large (more than 20,000) items the procedure is somewhat slow, and I was wondering if using NumPy would make it faster. I start from two lists, one of floats and one of strings: <pre class="prettyprint"><code>names = ["a", "b", "b", "c", "d", "e", "e"] values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01] </code></pre> I'm trying to calculate the mean of the identical values, so that after applying it, I'd get: <pre class="prettyprint"><code>result_names = ["a", "b", "c", "d", "e"] result_values = [1.2, 4.4, 2.0, 5.67, 8.54] </code></pre> I put two lists as a result example, but having also a list of <code>(name, value)</code> tuples would suffice: <pre class="prettyprint"><code>result = [("a", 1.2), ("b", 4.4), ("d", 5.67), ("e", 8.54)] </code></pre> What's the best way to do this with NumPy?

With numpy you can write something yourself, or you can use groupby functionality (the rec_groupby function from matplotlib.mlab, but which is much slower. For more powerful groupby functionality, maybe look at pandas), and I compared it with the answer of Michael Dunn with a dictionary: <pre class="prettyprint"><code>import numpy as np import random from matplotlib.mlab import rec_groupby listA = [random.choice("abcdef") for i in range(20000)] listB = [20 * random.random() for i in range(20000)] names = np.array(listA) values = np.array(listB) def f_dict(listA, listB): d = {} for a, b in zip(listA, listB): d.setdefault(a, []).append(b) avg = [] for key in d: avg.append(sum(d[key])/len(d[key])) return d.keys(), avg def f_numpy(names, values): result_names = np.unique(names) result_values = np.empty(result_names.shape) for i, name in enumerate(result_names): result_values[i] = np.mean(values[names == name]) return result_names, result_values </code></pre> This is the result for the three: <pre class="prettyprint"><code>In [2]: f_dict(listA, listB) Out[2]: (['a', 'c', 'b', 'e', 'd', 'f'], [9.9003182717213765, 10.077784850173568, 9.8623915728699636, 9.9790599744319319, 9.8811096512807097, 10.118695410115953]) In [3]: f_numpy(names, values) Out[3]: (array(['a', 'b', 'c', 'd', 'e', 'f'], dtype='|S1'), array([ 9.90031827, 9.86239157, 10.07778485, 9.88110965, 9.97905997, 10.11869541])) In [7]: rec_groupby(struct_array, ('names',), (('values', np.mean, 'resvalues'),)) Out[7]: rec.array([('a', 9.900318271721376), ('b', 9.862391572869964), ('c', 10.077784850173568), ('d', 9.88110965128071), ('e', 9.979059974431932), ('f', 10.118695410115953)], dtype=[('names', '|S1'), ('resvalues', '<f8')]) </code></pre> And it seems that numpy is a little bit faster for this test (and the pre-defined groupby function much slower): <pre class="prettyprint"><code>In [32]: %timeit f_dict(listA, listB) 10 loops, best of 3: 23 ms per loop In [33]: %timeit f_numpy(names, values) 100 loops, best of 3: 9.78 ms per loop In [8]: %timeit rec_groupby(struct_array, ('names',), (('values', np.mean, 'values'),)) 1 loops, best of 3: 203 ms per loop </code></pre>

Average duplicate values from two paired lists in Python using NumPy

Tags:

python

list

numpy

average

In the past I have faced myself dealing with averaging two paired lists and I have used the answers provided there successfully.

However with large (more than 20,000) items the procedure is somewhat slow, and I was wondering if using NumPy would make it faster.

I start from two lists, one of floats and one of strings:

names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]

I'm trying to calculate the mean of the identical values, so that after applying it, I'd get:

result_names = ["a", "b", "c", "d", "e"]
result_values = [1.2, 4.4, 2.0, 5.67, 8.54]

I put two lists as a result example, but having also a list of (name, value) tuples would suffice:

result = [("a", 1.2), ("b", 4.4), ("d", 5.67), ("e", 8.54)]

What's the best way to do this with NumPy?

588

asked Oct 17 '11 07:10

Einar

1 Answers

With numpy you can write something yourself, or you can use groupby functionality (the rec_groupby function from matplotlib.mlab, but which is much slower. For more powerful groupby functionality, maybe look at pandas), and I compared it with the answer of Michael Dunn with a dictionary:

import numpy as np
import random
from matplotlib.mlab import rec_groupby

listA = [random.choice("abcdef") for i in range(20000)]
listB = [20 * random.random() for i in range(20000)]

names = np.array(listA)
values = np.array(listB)

def f_dict(listA, listB):
    d = {}

    for a, b in zip(listA, listB):
        d.setdefault(a, []).append(b)

    avg = []
    for key in d:
        avg.append(sum(d[key])/len(d[key]))

    return d.keys(), avg

def f_numpy(names, values):
    result_names = np.unique(names)
    result_values = np.empty(result_names.shape)

    for i, name in enumerate(result_names):
        result_values[i] = np.mean(values[names == name])

    return result_names, result_values

This is the result for the three:

In [2]: f_dict(listA, listB)
Out[2]: 
(['a', 'c', 'b', 'e', 'd', 'f'],
 [9.9003182717213765,
  10.077784850173568,
  9.8623915728699636,
  9.9790599744319319,
  9.8811096512807097,
  10.118695410115953])

In [3]: f_numpy(names, values)
Out[3]: 
(array(['a', 'b', 'c', 'd', 'e', 'f'], 
      dtype='|S1'),
 array([  9.90031827,   9.86239157,  10.07778485,   9.88110965,
         9.97905997,  10.11869541]))

In [7]: rec_groupby(struct_array, ('names',), (('values', np.mean, 'resvalues'),))
Out[7]: 
rec.array([('a', 9.900318271721376), ('b', 9.862391572869964),
       ('c', 10.077784850173568), ('d', 9.88110965128071),
       ('e', 9.979059974431932), ('f', 10.118695410115953)], 
      dtype=[('names', '|S1'), ('resvalues', '<f8')])

And it seems that numpy is a little bit faster for this test (and the pre-defined groupby function much slower):

In [32]: %timeit f_dict(listA, listB)
10 loops, best of 3: 23 ms per loop

In [33]: %timeit f_numpy(names, values)
100 loops, best of 3: 9.78 ms per loop

In [8]: %timeit rec_groupby(struct_array, ('names',), (('values', np.mean, 'values'),))
1 loops, best of 3: 203 ms per loop

112

answered Oct 03 '22 08:10

joris

Related questions
                            
                                Does Scikit-learn release the python GIL?
                            
                                Python & GTK3: How to create a Liststore
                            
                                How to use split with utf8 coding?
                            
                                Can someone please recommend me a good PyQt/PySide tutorial/book/video series? [closed]
                            
                                Spawning a separate thread of execution (i.e. sending log email to dev) in Flask Python?
                            
                                python subprocess with gzip
                            
                                Submodule importing primary module
                            
                                How do I make a query where it filters everything that starts with a number in Django?
                            
                                Remove contents of <style>...</style> tags using html5lib or bleach
                            
                                Divide set into subsets with equal number of elements
                            
                                Efficient way of XML parsing in ElementTree(1.3.0) Python
                            
                                Make SQLAlchemy COMMIT instead of ROLLBACK after a SELECT query
                            
                                How to quit a pygtk application after last window is closed/destroyed
                            
                                how to close a blocking socket while it is waiting to receive data?
                            
                                Proftpd verify complete upload
                            
                                Python: Removing duplicate CSV entries
                            
                                operator precedence: not and comparisons
                            
                                Python PIL: Create indexed color image with transparent background
                            
                                Is it worth using a multithreaded blas implementation along with multiprocessing in Python?
                            
                                Why is jython slow? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With