I'm trying to add float values of a vector according to integer values from another vector.
for instance if I have:
import numpy as np
a = np.array([0.1,0.2,0.3,0.4,0.5,0.6,07.3,0.8,0.9,1.,1.2,1.4])
b = np.array([0,0,0,0,0,1,1,1,2,2,2,2]).astype(int)
I would like to add the 5 first value of the a vector together (because the 5 first values of b are 0), the 3 next values together (because the 3 next values of b are 1) and so on. So At the end I woudl expect to have
c = function(a,b)
c = [0.1+0.2+0.3+0.4+0.5, 0.6+7.3+0.8, 0.9+1.+1.2+1.4]
To concatenate a vector or multiple vectors in R use c() function. This c() function is used to combine the objects by taking two or multiple vectors as input and returning the vector with the combined elements from all vectors.
Uni-dimensional arrays are called vectors in R. Two-dimensional arrays are called matrices.
Approach #1 : We can make use of np.bincount
with b
as the bins and a
as weights array -
In [203]: np.bincount(b,a)
Out[203]: array([1.5, 8.7, 4.5])
Approach #2 : Another leveraging matrix-multiplication
-
In [210]: (b == np.arange(b.max()+1)[:,None]).dot(a)
Out[210]: array([1.5, 8.7, 4.5])
For a pure numpy solution, you can check the np.diff()
of b
, which will give you a new array of zeros everywhere except wherever the values change. However, this needs one small tweak as np.diff()
reduces the size of your array by one element, so your indices will be off by one. There actually is current development in numpy to make this better (giving new arguments to pad the output back to original size; see the issue here: https://github.com/numpy/numpy/issues/8132)
With that said...here's something that should be instructive:
In [100]: a
Out[100]: array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 7.3, 0.8, 0.9, 1. , 1.2, 1.4])
In [101]: b
Out[101]: array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
In [102]: np.diff(b) # note it is one element shorter than b
Out[102]: array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0])
In [103]: np.flatnonzero(np.diff(b))
Out[103]: array([4, 7])
In [104]: np.flatnonzero(np.diff(b)) + 1
Out[104]: array([5, 8])
In [105]: np.insert(np.flatnonzero(np.diff(b)) + 1, 0, 0)
Out[105]: array([0, 5, 8]) # these are the indices of the start of each group
In [106]: indices = _
In [107]: np.add.reduceat(a, indices)
Out[107]: array([1.5, 8.7, 4.5])
In [108]: def sumatchanges(a, b):
...: indices = np.insert(np.flatnonzero(np.diff(b)) + 1, 0, 0)
...: return np.add.reduceat(a, indices)
...:
In [109]: sumatchanges(a, b)
Out[109]: array([1.5, 8.7, 4.5])
I would definitely prefer using Pandas groupby
as jpp's answer used in most settings, as this is ugly. Hopefully with those changes to numpy, this could be a bit nicer looking and more natural in the future.
Note that this answer is equivalent to the itertools.groupby
answer that Maarten gave (in output). Specifically, that is that the groups are assumed to be sequential. I.e., this
b = np.array([0,0,0,0,0,1,1,1,2,2,2,2]).astype(int)
would produce the same output as with
b = np.array([0,0,0,0,0,1,1,1,0,0,0,0]).astype(int)
The number is irrelevant, so long as it changes. However for the other solution Maarten gave, and the pandas solution by jpp, those will sum all the things with the same label, regardless of location. OP is not clear on which you prefer.
Here I'll create a random array for summing and a random array of increasing values with 100k entries each, and test both functions time:
In [115]: import timeit
In [116]: import pandas as pd
In [117]: def sumatchangespd(a, b):
...: return pd.Series(a).groupby(b).sum().values
...:
In [125]: l = 100_000
In [126]: a = np.random.rand(l)
In [127]: b = np.cumsum(np.random.randint(2, size=l))
In [128]: sumatchanges(a, b)
Out[128]:
array([2.83528234e-01, 6.66182064e-01, 9.32624292e-01, ...,
2.98379765e-01, 1.97586484e+00, 8.65103445e-04])
In [129]: %timeit sumatchanges(a, b)
1.91 ms ± 47.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [130]: %timeit sumatchangespd(a, b)
6.33 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Also just to make sure these are equivalent:
In [139]: all(np.isclose(sumatchanges(a, b), sumatchangespd(a, b)))
Out[139]: True
So the numpy version is faster (not too surprising). Again, these functions could do slightly different things though, depending on your input:
In [120]: b # numpy solution grabs each chunk as a separate piece
Out[120]: array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2])
In [121]: b[-4:] = 0
In [122]: b # pandas will sum the vals in a that have same vals in b
Out[122]: array([0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0])
In [123]: sumatchanges(a, b)
Out[123]: array([1.5, 8.7, 4.5])
In [124]: sumatchangespd(a, b)
Out[124]: array([6. , 8.7])
Divakar's main solution is brilliant and the best out of all of the above speed-wise:
In [144]: def sumatchangesbc(a, b):
...: return np.bincount(b,a)
...:
In [145]: %timeit sumatchangesbc(a, b)
175 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Order of magnitude faster than my numpy solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With