Suppose i have:
x1 = [1, 3, 2, 4]
and:
x2 = [0, 1, 1, 0]
with the same shape
now i want to "put x2 ontop of x1" and sum up all the numbers of x1 corresponding to the numbers of x2
so the end result is:
end = [1+4 ,3+2] # end[0] is the sum of all numbers of x1 where a 0 was in x2
this is a naive implementation using list to further clarify the question
store_0 = 0
store_1 = 0
x1 = [1, 3, 4, 2]
x2 = [0, 1, 1, 0]
for value_x1 ,value_x2 in zip(x1 ,x2):
if value_x2 == 0:
store_0 += value_x1
elif value_x2 == 1:
store_1 += value_x1
so my question: is there is a way to implement this in numpy without using loops or in general just faster?
To create a list in python we need to use .append method to create a list of lists. After writing the above code (create a list of lists), Ones you will print “List of Lists:”, listoflists” then the output will appear as a “ Lists of Lists: “ [ [101,102, 103], [104, 105]] “. You can refer to the below screenshot for creating a list of lists.
You’ll know the custom templates not only from the naming but also the “generic” icon for custom lists. And here you go! After hitting “Create”, you’ll get your new list based on the template you selected – and if your template had contents included, you’ll even get them prepopulated.
We can also create an empty list by using the built-in function list () without any arguments. After writing the above code (python create an empty list), once you will print ” list ” then the output will appear as ” [] “. Here, we can see that the empty list has been created.
Python String Template: The Python string Template is created by passing the template string to its constructor. It supports $-based substitutions. This class has 2 key methods: substitute (mapping, **kwds): This method performs substitutions using a dictionary with a process similar to key-based mapping objects.
In this particular example (and, in general, for unique
, duplicated
, and groupby
kinds of operations), pandas
is faster than a pure numpy
solution:
A pandas
way, using Series
(credit: very similar to @mcsoini's answer):
def pd_group_sum(x1, x2):
return pd.Series(x1, index=x2).groupby(x2).sum()
A pure numpy
way, using np.unique
and some fancy indexing:
def np_group_sum(a, groups):
_, ix, rix = np.unique(groups, return_index=True, return_inverse=True)
return np.where(np.arange(len(ix))[:, None] == rix, a, 0).sum(axis=1)
Note: a better pure numpy
way is inspired by @Woodford's answer:
def selsum(a, g, e):
return a[g==e].sum()
vselsum = np.vectorize(selsum, signature='(n),(n),()->()')
def np_group_sum2(a, groups):
return vselsum(a, groups, np.unique(groups))
Yet another pure numpy
way is inspired by a comment from @mapf about using argsort()
. That in itself already takes 45ms, but we may try something based on np.argpartition(x2, len(x2)-1)
instead, since that takes only 7.5ms by itself on the benchmark below:
def np_group_sum3(a, groups):
ix = np.argpartition(groups, len(groups)-1)
ends = np.nonzero(np.diff(np.r_[groups[ix], groups.max() + 1]))[0]
return np.diff(np.r_[0, a[ix].cumsum()[ends]])
(Slightly modified) example
x1 = np.array([1, 3, 2, 4, 8]) # I added a group for sake of generality
x2 = np.array([0, 1, 1, 0, 7])
>>> pd_group_sum(x1, x2)
0 5
1 5
7 8
>>> np_group_sum(x1, x2) # and all the np_group_sum() variants
array([5, 5, 8])
Speed
n = 1_000_000
x1 = np.random.randint(0, 20, n)
x2 = np.random.randint(0, 20, n)
%timeit pd_group_sum(x1, x2)
# 13.9 ms ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np_group_sum(x1, x2)
# 171 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np_group_sum2(x1, x2)
# 66.7 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np_group_sum3(x1, x2)
# 25.6 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Going via pandas is faster, in part because of numpy issue 11136.
>>> x1 = np.array([1, 3, 2, 7])
>>> x2 = np.array([0, 1, 1, 0])
>>> for index in np.unique(x2):
>>> print(f'{index}: {x1[x2==index].sum()}')
0: 8
1: 5
>>> # or in one line
>>> [(index, x1[x2==index].sum()) for index in np.unique(x2)]
[(0, 8), (1, 5)]
Would a pandas one-liner be ok?
store_0, store_1 = pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum()
Or as a dictionary, for arbitrarily many values in x2
:
pd.DataFrame({"x1": x1, "x2": x2}).groupby("x2").x1.sum().to_dict()
Output:
{0: 5, 1: 5}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With