Accelerate a pandas operation involving several dataframes

Question

Hello everyone

For a school project, I am stuck with the duration of an operation with Pandas Dataframe.

I have one dataframe df which shape is (250 000 000, 200). This dataframe contains values of variable describing the behaviours of sensors on a machine. They are organized by 'Cycle' (everytime the machine begins a new cycle, this variable is incremented by one). And in this cycle, 'CycleTime' describes the position of the row within the 'Cycle'.

In the 'mean' DataFrame, I compute the mean of each variables group by the 'CycleTime'

The 'anomaly_matrix' DataFrame represents the global anomaly of each cycle which is the sum of the square difference of each rows belonging to the Cycle with the mean of corresponding CycleTime.

An example of my code is below

df = pd.DataFrame({'Cycle': [0, 0, 0, 1, 1, 1, 2, 2], 'CycleTime': [0, 1, 2, 0, 1, 2, 0, 1], 'variable1': [0, 0.5, 0.25, 0.3, 0.4, 0.1, 0.2, 0.25], 'variable2':[1, 2, 1, 1, 2, 2, 1, 2], 'variable3': [100, 5000, 200, 900, 100, 2000, 300, 300]})
mean = df.drop(['Cycle'], axis = 1).groupby("CycleTime").agg('mean')
anomali_matrix = df.drop(['CycleTime'], axis = 1).groupby("Cycle").agg('mean')
anomaly_matrix = anomali_matrix - anomali_matrix

for index, row in df.iterrows():
    cycle = row["Cycle"]
    time = row["CycleTime"]
    anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2


>>>anomaly_matrix
   variable1    variable2   variable3
Cycle           
0   0.047014    0.25       1.116111e+07
1   0.023681    0.25       3.917778e+06
2   0.018889    0.00       2.267778e+06

This calculation for my (250 000 000, 200) DataFrame last 6 hours, it is due to anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2

I tried to improve by using an apply function but I do not succeed in adding other DataFrame in that apply function. Same thing trying to vectorize pandas.

Do you have any idea how to accelerate this process ? Thanks

jezrael · Accepted Answer

You can use:

df1 = df.set_index(['Cycle', 'CycleTime'])

mean = df1.sub(df1.groupby('CycleTime').transform('mean'))**2
df2 = mean.groupby('Cycle').sum()
print (df2)
       variable1  variable2     variable3
Cycle                                    
0       0.047014       0.25  1.116111e+07
1       0.023681       0.25  3.917778e+06
2       0.018889       0.00  2.267778e+06

Accelerate a pandas operation involving several dataframes

Tags:

python

pandas

dataframe

accelerate

NgDriver

1 Answers

jezrael

Recent Activity

Donate For Us

Accelerate a pandas operation involving several dataframes

Tags:

python

pandas

dataframe

accelerate

NgDriver

1 Answers

jezrael

Related questions

Recent Activity

Donate For Us