I wish to efficiently use pandas (or numpy) instead of a nested for loop with an if statement to solve a particular problem. Here is a toy version:
Suppose I have the following two DataFrames
import pandas as pd
import numpy as np
dict1 = {'vals': [100,200], 'in': [0,1], 'out' :[1,3]}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': [500,800,300,200], 'in': [0.1,0.5,2,4], 'out' :[0.5,2,4,5]}
df2 = pd.DataFrame(data=dict2)
Now I wish to loop through each row each dataframe and multiply the vals if a particular condition is met. This code works for what I want
ans = []
for i in range(len(df1)):
for j in range(len(df2)):
if (df1['in'][i] <= df2['out'][j] and df1['out'][i] >= df2['in'][j]):
ans.append(df1['vals'][i]*df2['vals'][j])
np.sum(ans)
However, clearly this is very inefficient and in reality my DataFrames can have millions of entries making this unusable. I am also not making us of pandas or numpy efficient vector implementations. Does anyone have any ideas how to efficiently vectorize this nested loop?
I feel like this code is something akin to matrix multiplication so could progress be made utilising outer? It's the if condition that I'm finding hard to wedge in, as the if logic needs to compare each entry in df1 against all entries in df2.
You can also use a compiler like Numba to do this job. This would also outperform the vectorized solution and doesn't need a temporary array.
Example
import numba as nb
import numpy as np
import pandas as pd
import time
@nb.njit(fastmath=True,parallel=True,error_model='numpy')
def your_function(df1_in,df1_out,df1_vals,df2_in,df2_out,df2_vals):
sum=0.
for i in nb.prange(len(df1_in)):
for j in range(len(df2_in)):
if (df1_in[i] <= df2_out[j] and df1_out[i] >= df2_in[j]):
sum+=df1_vals[i]*df2_vals[j]
return sum
Testing
dict1 = {'vals': np.random.randint(1, 100, 1000),
'in': np.random.randint(1, 10, 1000),
'out': np.random.randint(1, 10, 1000)}
df1 = pd.DataFrame(data=dict1)
dict2 = {'vals': np.random.randint(1, 100, 1500),
'in': 5*np.random.random(1500),
'out': 5*np.random.random(1500)}
df2 = pd.DataFrame(data=dict2)
# First call has some compilation overhead
res=your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
t1 = time.time()
for i in range(1000):
res = your_function(df1['in'].values, df1['out'].values, df1['vals'].values,
df2['in'].values, df2['out'].values, df2['vals'].values)
print(time.time() - t1)
Timings
vectorized solution @AGN Gazer: 9.15ms
parallelized Numba Version: 0.7ms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With