How to optimally apply a function on all items of a dataframe using inputs from another dataframe?

Question

I am new in python and I am currenlt struggly to do simple things with pandas. I would like to apply the same function to each item of a given dataset but using a time-dependent parameter.

I am working with pandas DataFrame with timestamps as index.

Let's say :

a(i,j) is ith element in column j in a dataframe A (timestamp/index = i and column = j)

b(i) is the ith element in a dataframe B (with a single column)

I want to compute:

c(i, j) = fct(a(i, j), b(i))

where fct is a function with two arguments z = fct(x, y)

I wrote a code that does it correcly but it is likely not optimal (very slow). For the example I just used a simple function fct (but in reallity it is more complex)

Inputs:

df_data: pandas.DataFrame with index=timestamps and several columns
df_parameter: pandas.DataFrame with 1 column containing the time-dependent parameter

Here is the code:

# p.concat is required as timestamps are not identical in df_data & df_parameters
import numpy as np
import pandas as p

temp = p.concat([df_data, df_parameter], join='inner', axis=1)
index = temp.index
np_data = temp[nacelleWindSpeeds.columns].values
np_parameter = temp[airDensity.columns].values

import math 

def fct(x, y):
    return math.pow(x, y)

def test(np_data, np_parameter):
    np_result = np.empty(np_data.shape, dtype=float)
    it = np.nditer(np_data, flags=['multi_index'])

    while not it.finished:
        np_result[it.multi_index] = fct(it[0].item(),
                                        np_parameter[it.multi_index[0]][0])
        it.iternext()

    df_final=p.DataFrame(data=np_result, index=index)
    return df_final

final=test(np_data, np_parameter)   

final.to_csv(r'C:	emp	est.csv', sep=';')

Here is some example data:

df_data

01/03/2010 00:00  ;  9  ;  5  ;  7  
01/03/2010 00:10  ;  9  ;  1  ;  4  
01/03/2010 00:20  ;  5  ;  3  ;  8  
01/03/2010 00:30  ;  7  ;  7  ;  1  
01/03/2010 00:40  ;  8  ;  2  ;  3  
01/03/2010 00:50  ;  0  ;  3  ;  4     
01/03/2010 01:00  ;  4  ;  3  ;  2  
01/03/2010 01:10  ;  6  ;  2  ;  2  
01/03/2010 01:20  ;  6  ;  8  ;  5  
01/03/2010 01:30  ;  7  ;  7  ;  0

df_parameter

01/03/2010 00:00  ;  2  
01/03/2010 00:10  ;  5  
01/03/2010 00:20  ;  2  
01/03/2010 00:30  ;  3  
01/03/2010 00:40  ;  0  
01/03/2010 00:50  ;  2  
01/03/2010 01:00  ;  4  
01/03/2010 01:10  ;  3  
01/03/2010 01:20  ;  3  
01/03/2010 01:30  ;  1

final

01/03/2010 00:00  ;  81  ;  25  ;  49  
01/03/2010 00:10  ;  59049  ;  1  ;  1024  
01/03/2010 00:20  ;  25  ;  9  ;  64  
01/03/2010 00:30  ;  343  ;  343  ;  1  
01/03/2010 00:40  ;  1  ;  1  ;  1  
01/03/2010 00:50  ;  0  ;  9  ;  16  
01/03/2010 01:00  ;  256  ;  81  ;  16  
01/03/2010 01:10  ;  216  ;  8  ;  8  
01/03/2010 01:20  ;  216  ;  512  ;  125  
01/03/2010 01:30  ;  7  ;  7  ;  0

Thank you very very much in advance for your help,

Patrick

bmu · Accepted Answer

Don't know if this is the optimal way, but this is simpler and should be more efficient as it uses vectorized functions for the calculations:

def func(x, y):
    return x ** y

data = pd.read_csv('data.dat', sep=';', index_col=0, parse_dates=True,
                    header=None, names='abc')
para = pd.read_csv('parameter.dat', sep=';', index_col=0, parse_dates=True,
                    header=None, names=['para'])

for col in data:
    data['%s_result' % col] = func(data[col], para.para)

print data

results in

                     a  b  c  a_result  b_result  c_result
2010-01-03 00:00:00  9  5  7        81        25        49
2010-01-03 00:10:00  9  1  4     59049         1      1024
2010-01-03 00:20:00  5  3  8        25         9        64
2010-01-03 00:30:00  7  7  1       343       343         1
2010-01-03 00:40:00  8  2  3         1         1         1
2010-01-03 00:50:00  0  3  4         0         9        16
2010-01-03 01:00:00  4  3  2       256        81        16
2010-01-03 01:10:00  6  2  2       216         8         8
2010-01-03 01:20:00  6  8  5       216       512       125
2010-01-03 01:30:00  7  7  0         7         7         0

If your real function is more complex you should even try to vectorize it or use numpy.vectorize() as the next best solution.

How to optimally apply a function on all items of a dataframe using inputs from another dataframe?

Tags:

python

pandas

numpy

sweetdream

1 Answers

bmu

Recent Activity

Donate For Us

How to optimally apply a function on all items of a dataframe using inputs from another dataframe?

Tags:

python

pandas

numpy

sweetdream

1 Answers

bmu

Related questions

Recent Activity

Donate For Us