Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to combine (add) values of a vector according to integer value of another vector

I'm trying to add float values of a vector according to integer values from another vector.

for instance if I have:

import numpy as np
a = np.array([0.1,0.2,0.3,0.4,0.5,0.6,07.3,0.8,0.9,1.,1.2,1.4])
b = np.array([0,0,0,0,0,1,1,1,2,2,2,2]).astype(int)

I would like to add the 5 first value of the a vector together (because the 5 first values of b are 0), the 3 next values together (because the 3 next values of b are 1) and so on. So At the end I woudl expect to have

c = function(a,b)
c = [0.1+0.2+0.3+0.4+0.5,  0.6+7.3+0.8, 0.9+1.+1.2+1.4]
like image 911
ymmx Avatar asked Oct 03 '18 09:10

ymmx


People also ask

How do I combine values into vectors in R?

To concatenate a vector or multiple vectors in R use c() function. This c() function is used to combine the objects by taking two or multiple vectors as input and returning the vector with the combined elements from all vectors.

What is the difference between vector and matrix in R programming?

Uni-dimensional arrays are called vectors in R. Two-dimensional arrays are called matrices.


2 Answers

Approach #1 : We can make use of np.bincount with b as the bins and a as weights array -

In [203]: np.bincount(b,a)
Out[203]: array([1.5, 8.7, 4.5])

Approach #2 : Another leveraging matrix-multiplication -

In [210]: (b == np.arange(b.max()+1)[:,None]).dot(a)
Out[210]: array([1.5, 8.7, 4.5])
like image 61
Divakar Avatar answered Nov 29 '22 05:11

Divakar


For a pure numpy solution, you can check the np.diff() of b, which will give you a new array of zeros everywhere except wherever the values change. However, this needs one small tweak as np.diff() reduces the size of your array by one element, so your indices will be off by one. There actually is current development in numpy to make this better (giving new arguments to pad the output back to original size; see the issue here: https://github.com/numpy/numpy/issues/8132)

With that said...here's something that should be instructive:

In [100]: a
Out[100]: array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 7.3, 0.8, 0.9, 1. , 1.2, 1.4])

In [101]: b
Out[101]: array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2])

In [102]: np.diff(b) # note it is one element shorter than b
Out[102]: array([0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0])

In [103]: np.flatnonzero(np.diff(b))
Out[103]: array([4, 7]) 

In [104]: np.flatnonzero(np.diff(b)) + 1
Out[104]: array([5, 8])

In [105]: np.insert(np.flatnonzero(np.diff(b)) + 1, 0, 0)
Out[105]: array([0, 5, 8]) # these are the indices of the start of each group

In [106]: indices = _

In [107]: np.add.reduceat(a, indices)
Out[107]: array([1.5, 8.7, 4.5])

In [108]: def sumatchanges(a, b):
     ...:     indices = np.insert(np.flatnonzero(np.diff(b)) + 1, 0, 0)
     ...:     return np.add.reduceat(a, indices)
     ...:

In [109]: sumatchanges(a, b)
Out[109]: array([1.5, 8.7, 4.5])

I would definitely prefer using Pandas groupby as jpp's answer used in most settings, as this is ugly. Hopefully with those changes to numpy, this could be a bit nicer looking and more natural in the future.


Note that this answer is equivalent to the itertools.groupby answer that Maarten gave (in output). Specifically, that is that the groups are assumed to be sequential. I.e., this

b = np.array([0,0,0,0,0,1,1,1,2,2,2,2]).astype(int)

would produce the same output as with

b = np.array([0,0,0,0,0,1,1,1,0,0,0,0]).astype(int)

The number is irrelevant, so long as it changes. However for the other solution Maarten gave, and the pandas solution by jpp, those will sum all the things with the same label, regardless of location. OP is not clear on which you prefer.


Timing:

Here I'll create a random array for summing and a random array of increasing values with 100k entries each, and test both functions time:

In [115]: import timeit
In [116]: import pandas as pd

In [117]: def sumatchangespd(a, b):
     ...:     return pd.Series(a).groupby(b).sum().values
     ...:

In [125]: l = 100_000

In [126]: a = np.random.rand(l)

In [127]: b = np.cumsum(np.random.randint(2, size=l))

In [128]: sumatchanges(a, b)
Out[128]:
array([2.83528234e-01, 6.66182064e-01, 9.32624292e-01, ...,
       2.98379765e-01, 1.97586484e+00, 8.65103445e-04])

In [129]: %timeit sumatchanges(a, b)
1.91 ms ± 47.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [130]: %timeit sumatchangespd(a, b)
6.33 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Also just to make sure these are equivalent:

In [139]: all(np.isclose(sumatchanges(a, b), sumatchangespd(a, b)))
Out[139]: True

So the numpy version is faster (not too surprising). Again, these functions could do slightly different things though, depending on your input:

In [120]: b  # numpy solution grabs each chunk as a separate piece
Out[120]: array([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2])

In [121]: b[-4:] = 0

In [122]: b   # pandas will sum the vals in a that have same vals in b
Out[122]: array([0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0])

In [123]: sumatchanges(a, b)
Out[123]: array([1.5, 8.7, 4.5])

In [124]: sumatchangespd(a, b)
Out[124]: array([6. , 8.7])

Divakar's main solution is brilliant and the best out of all of the above speed-wise:

In [144]: def sumatchangesbc(a, b):
     ...:     return np.bincount(b,a)
     ...:

In [145]: %timeit sumatchangesbc(a, b)
175 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Order of magnitude faster than my numpy solution.

like image 42
alkasm Avatar answered Nov 29 '22 05:11

alkasm