Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to vectorize (make use of pandas/numpy) instead of using a nested for loop

I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:

Suppose I have the following two DataFrames

import pandas as pd
import numpy as np

dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)

dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)

Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want

ans = []

for i in range(len(df1)):
    for j in range(len(df2)):
        if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
            ans.append(df1['vals'][i]*df2['vals'][j])

np.sum(ans)

However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?

I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.

like image 217
mch56 Avatar asked Dec 18 '25 23:12

mch56


1 Answers

You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.

Example

import numba as nb
import numpy as np
import pandas as pd
import time

@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
  sum=0.
  for i in nb.prange(len(df1_in)):
      for j in range(len(df2_in)):
          if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
              sum+=df1_vals[i]*df2_vals[j]
  return sum

Testing

dict1 = {'vals': np.random.randint(1, 100, 1000),
         'in': np.random.randint(1, 10, 1000),
         'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
         'in': 5*np.random.random(1500),
         'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)

# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
                  df2['in'].values, df2['out'].values, df2['vals'].values)

t1 = time.time()
for i in range(1000):
  res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
                      df2['in'].values, df2['out'].values, df2['vals'].values)

print(time.time() - t1)

Timings

vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
like image 160
max9111 Avatar answered Dec 20 '25 11:12

max9111