Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accelerate a pandas operation involving several dataframes

Hello everyone

For a school project, I am stuck with the duration of an operation with Pandas Dataframe.

I have one dataframe df which shape is (250 000 000, 200). This dataframe contains values of variable describing the behaviours of sensors on a machine. They are organized by 'Cycle' (everytime the machine begins a new cycle, this variable is incremented by one). And in this cycle, 'CycleTime' describes the position of the row within the 'Cycle'.

In the 'mean' DataFrame, I compute the mean of each variables group by the 'CycleTime'

The 'anomaly_matrix' DataFrame represents the global anomaly of each cycle which is the sum of the square difference of each rows belonging to the Cycle with the mean of corresponding CycleTime.

An example of my code is below

df = pd.DataFrame({'Cycle': [0, 0, 0, 1, 1, 1, 2, 2], 'CycleTime': [0, 1, 2, 0, 1, 2, 0, 1], 'variable1': [0, 0.5, 0.25, 0.3, 0.4, 0.1, 0.2, 0.25], 'variable2':[1, 2, 1, 1, 2, 2, 1, 2], 'variable3': [100, 5000, 200, 900, 100, 2000, 300, 300]})
mean = df.drop(['Cycle'], axis = 1).groupby("CycleTime").agg('mean')
anomali_matrix = df.drop(['CycleTime'], axis = 1).groupby("Cycle").agg('mean')
anomaly_matrix = anomali_matrix - anomali_matrix

for index, row in df.iterrows():
    cycle = row["Cycle"]
    time = row["CycleTime"]
    anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2


>>>anomaly_matrix
   variable1    variable2   variable3
Cycle           
0   0.047014    0.25       1.116111e+07
1   0.023681    0.25       3.917778e+06
2   0.018889    0.00       2.267778e+06

This calculation for my (250 000 000, 200) DataFrame last 6 hours, it is due to anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2

I tried to improve by using an apply function but I do not succeed in adding other DataFrame in that apply function. Same thing trying to vectorize pandas.

Do you have any idea how to accelerate this process ? Thanks

like image 452
NgDriver Avatar asked Jan 25 '26 23:01

NgDriver


1 Answers

You can use:

df1 = df.set_index(['Cycle', 'CycleTime'])

mean = df1.sub(df1.groupby('CycleTime').transform('mean'))**2
df2 = mean.groupby('Cycle').sum()
print (df2)
       variable1  variable2     variable3
Cycle                                    
0       0.047014       0.25  1.116111e+07
1       0.023681       0.25  3.917778e+06
2       0.018889       0.00  2.267778e+06
like image 127
jezrael Avatar answered Jan 27 '26 11:01

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!