Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimally apply a function on all items of a dataframe using inputs from another dataframe?

I am new in python and I am currenlt struggly to do simple things with pandas. I would like to apply the same function to each item of a given dataset but using a time-dependent parameter.

I am working with pandas DataFrame with timestamps as index.

Let's say :

a(i,j) is ith element in column j in a dataframe A (timestamp/index = i and column = j)

b(i) is the ith element in a dataframe B (with a single column)

I want to compute:

c(i, j) = fct(a(i, j), b(i))

where fct is a function with two arguments z = fct(x, y)

I wrote a code that does it correcly but it is likely not optimal (very slow). For the example I just used a simple function fct (but in reallity it is more complex)

Inputs:

  • df_data: pandas.DataFrame with index=timestamps and several columns
  • df_parameter: pandas.DataFrame with 1 column containing the time-dependent parameter

Here is the code:

# p.concat is required as timestamps are not identical in df_data & df_parameters
import numpy as np
import pandas as p

temp = p.concat([df_data, df_parameter], join='inner', axis=1)
index = temp.index
np_data = temp[nacelleWindSpeeds.columns].values
np_parameter = temp[airDensity.columns].values

import math 

def fct(x, y):
    return math.pow(x, y)

def test(np_data, np_parameter):
    np_result = np.empty(np_data.shape, dtype=float)
    it = np.nditer(np_data, flags=['multi_index'])

    while not it.finished:
        np_result[it.multi_index] = fct(it[0].item(),
                                        np_parameter[it.multi_index[0]][0])
        it.iternext()

    df_final=p.DataFrame(data=np_result, index=index)
    return df_final

final=test(np_data, np_parameter)   

final.to_csv(r'C:\temp\test.csv', sep=';')

Here is some example data:

df_data

01/03/2010 00:00  ;  9  ;  5  ;  7  
01/03/2010 00:10  ;  9  ;  1  ;  4  
01/03/2010 00:20  ;  5  ;  3  ;  8  
01/03/2010 00:30  ;  7  ;  7  ;  1  
01/03/2010 00:40  ;  8  ;  2  ;  3  
01/03/2010 00:50  ;  0  ;  3  ;  4     
01/03/2010 01:00  ;  4  ;  3  ;  2  
01/03/2010 01:10  ;  6  ;  2  ;  2  
01/03/2010 01:20  ;  6  ;  8  ;  5  
01/03/2010 01:30  ;  7  ;  7  ;  0  

df_parameter

01/03/2010 00:00  ;  2  
01/03/2010 00:10  ;  5  
01/03/2010 00:20  ;  2  
01/03/2010 00:30  ;  3  
01/03/2010 00:40  ;  0  
01/03/2010 00:50  ;  2  
01/03/2010 01:00  ;  4  
01/03/2010 01:10  ;  3  
01/03/2010 01:20  ;  3  
01/03/2010 01:30  ;  1  

final

01/03/2010 00:00  ;  81  ;  25  ;  49  
01/03/2010 00:10  ;  59049  ;  1  ;  1024  
01/03/2010 00:20  ;  25  ;  9  ;  64  
01/03/2010 00:30  ;  343  ;  343  ;  1  
01/03/2010 00:40  ;  1  ;  1  ;  1  
01/03/2010 00:50  ;  0  ;  9  ;  16  
01/03/2010 01:00  ;  256  ;  81  ;  16  
01/03/2010 01:10  ;  216  ;  8  ;  8  
01/03/2010 01:20  ;  216  ;  512  ;  125  
01/03/2010 01:30  ;  7  ;  7  ;  0  

Thank you very very much in advance for your help,

Patrick

like image 531
sweetdream Avatar asked Dec 04 '25 16:12

sweetdream


1 Answers

Don't know if this is the optimal way, but this is simpler and should be more efficient as it uses vectorized functions for the calculations:

def func(x, y):
    return x ** y

data = pd.read_csv('data.dat', sep=';', index_col=0, parse_dates=True,
                    header=None, names='abc')
para = pd.read_csv('parameter.dat', sep=';', index_col=0, parse_dates=True,
                    header=None, names=['para'])

for col in data:
    data['%s_result' % col] = func(data[col], para.para)

print data

results in

                     a  b  c  a_result  b_result  c_result
2010-01-03 00:00:00  9  5  7        81        25        49
2010-01-03 00:10:00  9  1  4     59049         1      1024
2010-01-03 00:20:00  5  3  8        25         9        64
2010-01-03 00:30:00  7  7  1       343       343         1
2010-01-03 00:40:00  8  2  3         1         1         1
2010-01-03 00:50:00  0  3  4         0         9        16
2010-01-03 01:00:00  4  3  2       256        81        16
2010-01-03 01:10:00  6  2  2       216         8         8
2010-01-03 01:20:00  6  8  5       216       512       125
2010-01-03 01:30:00  7  7  0         7         7         0

If your real function is more complex you should even try to vectorize it or use numpy.vectorize() as the next best solution.

like image 139
bmu Avatar answered Dec 06 '25 05:12

bmu