Let's assume I have a large 2D numpy array, e.g. 1000x1000 elements. I also have two 1D integer arrays of length L, and a float 1D arrray of the same length. If I want to simply assign floats to different positions in the original array according to integer array, I could write:
mat = np.zeros((1000,1000))
int1 = np.random.randint(0,999,size=(50000,))
int2 = np.random.randint(0,999,size=(50000,))
f = np.random.rand(50000)
mat[int1,int2] = f
But if there were collisions i.e. multiple floats corresponding to single location, all but the last would be overwritten. Is there a way to somehow aggregate all the collisions, e.g. mean or median of all the floats falling at the same location? I would like to take advantage of vectorization and hopefully avoid interpreter loops.
Thanks!
Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a single numpy array or a tuple of numpy arrays. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.
"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.
Again, some have observed vectorize to be faster than normal for loops, but even the NumPy documentation states: “The vectorize function is provided primarily for convenience, not for performance.
Vectorization is a technique of implementing array operations without using for loops. Instead, we use functions defined by various modules which are highly optimized that reduces the running and execution time of code.
Building on hpaulj's suggestion, here's how to get the mean value in case of collisions:
import numpy as np
mat = np.zeros((2,2))
int1 = np.zeros(2, dtype=int)
int2 = np.zeros(2, dtype=int)
f = np.array([0,1])
np.add.at(mat, [int1, int2], f)
n = np.zeros((2,2))
np.add.at(n, [int1, int2], 1)
mat[int1, int2] /= n[int1, int2]
print(mat)
array([[0.5, 0. ],
[0. , 0. ]])
You can manipulate your data in pandas
and then assign.
Starting from
mat = np.zeros((1000,1000))
a = np.random.randint(0,999,size=(50000,))
b = np.random.randint(0,999,size=(50000,))
c = np.random.rand(50000)
You can define a function
def get_aggregated_collisions(a,b,c):
df = pd.DataFrame({'x':a, 'y':b, 'v':c})
df['coord'] = df[['x','y']].apply(tuple,1)
d = df.groupby('coord').agg({"v":'mean','x':'first', 'y':'first'}).to_dict('list')
return d
and then
d = get_aggregated_collisions(a,b,c)
mat[d['x'], d['y']] = d['v']
The whole operation (including generating the matrixes, np.random
etc) ran quite ok
1.05 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The idea behind making a tuple
of coordinates was to have a hashable option to group values by their coordinates. Maybe there is even a smarter way to do this :) always open to suggestions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With