Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorized assignment in Numpy

Let's assume I have a large 2D numpy array, e.g. 1000x1000 elements. I also have two 1D integer arrays of length L, and a float 1D arrray of the same length. If I want to simply assign floats to different positions in the original array according to integer array, I could write:

mat = np.zeros((1000,1000))
int1 = np.random.randint(0,999,size=(50000,))
int2 = np.random.randint(0,999,size=(50000,))
f = np.random.rand(50000)
mat[int1,int2] = f

But if there were collisions i.e. multiple floats corresponding to single location, all but the last would be overwritten. Is there a way to somehow aggregate all the collisions, e.g. mean or median of all the floats falling at the same location? I would like to take advantage of vectorization and hopefully avoid interpreter loops.

Thanks!

like image 832
Cindy Almighty Avatar asked Jun 29 '18 00:06

Cindy Almighty


People also ask

What is a vectorized operation in NumPy?

Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a single numpy array or a tuple of numpy arrays. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

What is a vectorized array?

"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

Does NumPy vectorize fast?

Again, some have observed vectorize to be faster than normal for loops, but even the NumPy documentation states: “The vectorize function is provided primarily for convenience, not for performance.

How does vectorization work in python?

Vectorization is a technique of implementing array operations without using for loops. Instead, we use functions defined by various modules which are highly optimized that reduces the running and execution time of code.


Video Answer


2 Answers

Building on hpaulj's suggestion, here's how to get the mean value in case of collisions:

import numpy as np

mat = np.zeros((2,2))
int1 = np.zeros(2, dtype=int)
int2 = np.zeros(2, dtype=int)
f = np.array([0,1])

np.add.at(mat, [int1, int2], f)
n = np.zeros((2,2))
np.add.at(n, [int1, int2], 1)
mat[int1, int2] /= n[int1, int2]
print(mat)

array([[0.5, 0. ],
       [0. , 0. ]])
like image 119
Julien Avatar answered Sep 26 '22 04:09

Julien


You can manipulate your data in pandas and then assign.

Starting from

mat = np.zeros((1000,1000))
a = np.random.randint(0,999,size=(50000,))
b = np.random.randint(0,999,size=(50000,))
c = np.random.rand(50000)

You can define a function

def get_aggregated_collisions(a,b,c):
    df = pd.DataFrame({'x':a, 'y':b, 'v':c})
    df['coord'] = df[['x','y']].apply(tuple,1)
    d = df.groupby('coord').agg({"v":'mean','x':'first', 'y':'first'}).to_dict('list')
    return d

and then

d = get_aggregated_collisions(a,b,c)
mat[d['x'], d['y']] = d['v']

The whole operation (including generating the matrixes, np.random etc) ran quite ok

1.05 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The idea behind making a tuple of coordinates was to have a hashable option to group values by their coordinates. Maybe there is even a smarter way to do this :) always open to suggestions.

like image 45
rafaelc Avatar answered Sep 23 '22 04:09

rafaelc