Suppose i have: <pre class="prettyprint"><code>x1 = [1, 3, 2, 4] </code></pre> and: <pre class="prettyprint"><code>x2 = [0, 1, 1, 0] </code></pre> with the same shape now i want to "put x2 ontop of x1" and sum up all the numbers of x1 corresponding to the numbers of x2 so the end result is: <pre class="prettyprint"><code>end = [1+4 ,3+2] # end[0] is the sum of all numbers of x1 where a 0 was in x2 </code></pre> this is a naive implementation using list to further clarify the question <pre class="prettyprint lang-py prettyprint-override"><code>store_0 = 0 store_1 = 0 x1 = [1, 3, 4, 2] x2 = [0, 1, 1, 0] for value_x1 ,value_x2 in zip(x1 ,x2): if value_x2 == 0: store_0 += value_x1 elif value_x2 == 1: store_1 += value_x1 </code></pre> so my question: is there is a way to implement this in numpy without using loops or in general just faster?

In this particular example (and, in general, for <code>unique</code>, <code>duplicated</code>, and <code>groupby</code> kinds of operations), <code>pandas</code> is faster than a pure <code>numpy</code> solution: A <code>pandas</code> way, using <code>Series</code> (credit: very similar to @mcsoini's answer): <pre class="prettyprint lang-py prettyprint-override"><code>def pd_group_sum(x1, x2): return pd.Series(x1, index=x2).groupby(x2).sum() </code></pre> A pure <code>numpy</code> way, using <code>np.unique</code> and some fancy indexing: <pre class="prettyprint lang-py prettyprint-override"><code>def np_group_sum(a, groups): _, ix, rix = np.unique(groups, return_index=True, return_inverse=True) return np.where(np.arange(len(ix))[:, None] == rix, a, 0).sum(axis=1) </code></pre> Note: a better pure <code>numpy</code> way is inspired by @Woodford's answer: <pre class="prettyprint lang-py prettyprint-override"><code>def selsum(a, g, e): return a[g==e].sum() vselsum = np.vectorize(selsum, signature='(n),(n),()->()') def np_group_sum2(a, groups): return vselsum(a, groups, np.unique(groups)) </code></pre> Yet another pure <code>numpy</code> way is inspired by a comment from @mapf about using <code>argsort()</code>. That in itself already takes 45ms, but we may try something based on <code>np.argpartition(x2, len(x2)-1)</code> instead, since that takes only 7.5ms by itself on the benchmark below: <pre class="prettyprint lang-py prettyprint-override"><code>def np_group_sum3(a, groups): ix = np.argpartition(groups, len(groups)-1) ends = np.nonzero(np.diff(np.r_[groups[ix], groups.max() + 1]))[0] return np.diff(np.r_[0, a[ix].cumsum()[ends]]) </code></pre> (Slightly modified) example <pre class="prettyprint lang-py prettyprint-override"><code>x1 = np.array([1, 3, 2, 4, 8]) # I added a group for sake of generality x2 = np.array([0, 1, 1, 0, 7]) >>> pd_group_sum(x1, x2) 0 5 1 5 7 8 >>> np_group_sum(x1, x2) # and all the np_group_sum() variants array([5, 5, 8]) </code></pre> Speed <pre class="prettyprint lang-py prettyprint-override"><code>n = 1_000_000 x1 = np.random.randint(0, 20, n) x2 = np.random.randint(0, 20, n) %timeit pd_group_sum(x1, x2) # 13.9 ms ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %timeit np_group_sum(x1, x2) # 171 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit np_group_sum2(x1, x2) # 66.7 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit np_group_sum3(x1, x2) # 25.6 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre> Going via pandas is faster, in part because of numpy issue 11136.

<pre class="prettyprint lang-py prettyprint-override"><code>>>> x1 = np.array([1, 3, 2, 7]) >>> x2 = np.array([0, 1, 1, 0]) >>> for index in np.unique(x2): >>> print(f'{index}: {x1[x2==index].sum()}') 0: 8 1: 5 >>> # or in one line >>> [(index, x1[x2==index].sum()) for index in np.unique(x2)] [(0, 8), (1, 5)] </code></pre>

Would a pandas one-liner be ok? <pre class="prettyprint"><code>store_0, store_1 = pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum() </code></pre> Or as a dictionary, for arbitrarily many values in <code>x2</code>: <pre class="prettyprint"><code>pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum().to_dict() </code></pre> Output: <pre class="prettyprint"><code>{0: 5, 1: 5} </code></pre>

python creating new list using a "template list"

Tags:

python

numpy

Suppose i have:

x1 = [1, 3, 2, 4]

and:

x2 = [0, 1, 1, 0]

with the same shape

now i want to "put x2 ontop of x1" and sum up all the numbers of x1 corresponding to the numbers of x2

so the end result is:

end = [1+4 ,3+2]  # end[0] is the sum of all numbers of x1 where a 0 was in x2

this is a naive implementation using list to further clarify the question

store_0 = 0
store_1 = 0
x1 = [1, 3, 4, 2]
x2 = [0, 1, 1, 0]
for value_x1 ,value_x2 in zip(x1 ,x2):
    if value_x2 == 0:
        store_0 += value_x1
    elif value_x2 == 1:
        store_1 += value_x1

so my question: is there is a way to implement this in numpy without using loops or in general just faster?

292

asked Apr 26 '21 18:04

user15770670

Video Answer

3 Answers

In this particular example (and, in general, for unique, duplicated, and groupby kinds of operations), pandas is faster than a pure numpy solution:

A pandas way, using Series (credit: very similar to @mcsoini's answer):

def pd_group_sum(x1, x2):
    return pd.Series(x1, index=x2).groupby(x2).sum()

A pure numpy way, using np.unique and some fancy indexing:

def np_group_sum(a, groups):
    _, ix, rix = np.unique(groups, return_index=True, return_inverse=True)
    return np.where(np.arange(len(ix))[:, None] == rix, a, 0).sum(axis=1)

Note: a better pure numpy way is inspired by @Woodford's answer:

def selsum(a, g, e):
    return a[g==e].sum()

vselsum = np.vectorize(selsum, signature='(n),(n),()->()')

def np_group_sum2(a, groups):
    return vselsum(a, groups, np.unique(groups))

Yet another pure numpy way is inspired by a comment from @mapf about using argsort(). That in itself already takes 45ms, but we may try something based on np.argpartition(x2, len(x2)-1) instead, since that takes only 7.5ms by itself on the benchmark below:

def np_group_sum3(a, groups):
    ix = np.argpartition(groups, len(groups)-1)
    ends = np.nonzero(np.diff(np.r_[groups[ix], groups.max() + 1]))[0]
    return np.diff(np.r_[0, a[ix].cumsum()[ends]])

(Slightly modified) example

x1 = np.array([1, 3, 2, 4, 8])  # I added a group for sake of generality
x2 = np.array([0, 1, 1, 0, 7])

>>> pd_group_sum(x1, x2)
0    5
1    5
7    8

>>> np_group_sum(x1, x2)  # and all the np_group_sum() variants
array([5, 5, 8])

Speed

n = 1_000_000
x1 = np.random.randint(0, 20, n)
x2 = np.random.randint(0, 20, n)

%timeit pd_group_sum(x1, x2)
# 13.9 ms ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit np_group_sum(x1, x2)
# 171 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np_group_sum2(x1, x2)
# 66.7 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np_group_sum3(x1, x2)
# 25.6 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Going via pandas is faster, in part because of numpy issue 11136.

140

answered Oct 13 '22 05:10

Pierre D

>>> x1 = np.array([1, 3, 2, 7])
>>> x2 = np.array([0, 1, 1, 0])
>>> for index in np.unique(x2):
>>>     print(f'{index}: {x1[x2==index].sum()}')
0: 8
1: 5
>>> # or in one line
>>> [(index, x1[x2==index].sum()) for index in np.unique(x2)]
[(0, 8), (1, 5)]

answered Oct 13 '22 04:10

Woodford

Would a pandas one-liner be ok?

store_0, store_1 = pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum()

Or as a dictionary, for arbitrarily many values in x2:

pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum().to_dict()

Output:

{0: 5, 1: 5}

answered Oct 13 '22 04:10

mcsoini

Related questions
                            
                                loading EMNIST-letters dataset
                            
                                How to install pyzmq on an Alpine Linux container?
                            
                                Python request gives 415 error while post data
                            
                                Frame from video is upside down after extracting
                            
                                change column name pandas
                            
                                A pythonic and uFunc-y way to turn pandas column into "increasing" index? [duplicate]
                            
                                Double header in Matplotlib Table
                            
                                Can't start spyder because of PyQt5.QtWebKitWidgets
                            
                                How to rewrite this simple loop using assignment expressions introduced in Python 3.8 alpha?
                            
                                Django messages middleware issue while testing post request
                            
                                How to convert single list's elements in form of dictionary
                            
                                PyInstaller exe returning error on a Tkinter script
                            
                                find index of a value before the maximum for each column in python dataframe
                            
                                How to separate Pandas column that contains values stored as text and numbers into two seperate columns
                            
                                How to flatten a list that has: primitives data types, lists and generators?
                            
                                How can I select rows from a Pandas dataframe were any value is not equal to a number?
                            
                                How to see complete rows in Google Colab
                            
                                Python sklearn installation windows
                            
                                Correct way of normalizing and scaling the MNIST dataset
                            
                                How to use LanguageDetector() from spacy_langdetect package?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With