Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating activity interval for a pandas DataFrame

Given the following DataFrame of pandas in Python:

date                  ID_bulb  switch   using_time          error
2022-02-05 14:30:21+00:00   5   OFF       NaT               INIT
2022-02-27 15:30:21+00:00   5   ON        NaT               ERROR2
2022-02-27 17:05:21+00:00   5   OFF   0 days 01:35:00       OK
2022-04-07 15:30:21+00:00   5   ON        NaT               OK
2022-04-07 15:30:21+00:00   5   OFF   0 days 00:00:00       OK
2022-04-07 17:05:21+00:00   5   OFF       NaT               ERROR2
2022-04-06 15:30:21+00:00   4   ON        NaT               INIT
2022-04-06 15:35:21+00:00   4   OFF       NaT               ERROR1
2022-04-06 16:10:21+00:00   4   ON        NaT               ERROR2
2022-04-07 15:30:21+00:00   4   OFF   0 days 23:20:00       OK
2022-04-07 17:05:21+00:00   4   ON        NaT               ERROR2
2022-01-01 19:40:21+00:00   3   ON        NaT               INIT
2022-02-03 22:40:21+00:00   3   ON        NaT               ERROR2
2022-02-03 23:20:21+00:00   3   OFF   0 days 00:40:00       OK
2022-02-04 00:20:21+00:00   3   ON        NaT               OK
2022-02-04 14:30:21+00:00   3   ON        NaT               ERROR2
2022-02-04 15:30:21+00:00   3   ON        NaT               ERROR2
2022-02-04 15:35:21+00:00   3   OFF   0 days 00:05:00       OK
2022-02-04 15:40:21+00:00   3   OFF       NaT               ERROR2
2022-02-04 19:40:21+00:00   3   ON        NaT               OK
2022-02-06 15:35:21+00:00   3   OFF   1 days 19:55:00       OK
2022-02-28 18:40:21+00:00   3   ON        NaT               ERROR1
2022-10-12 18:40:21+00:00   3   OFF   226 days 00:00:00     OK
2022-02-04 09:10:21+00:00   2   ON        NaT               OK
2022-02-04 14:10:21+00:00   2   ON        NaT               ERROR2

Problem to solve: I want to add a new column, called cost_days. This column will include only for rows where the variable using_time is different from NaT. Information on how many times the light bulb has been on during at least n hours in a row between a night period defined by start_time to end_time.

Proposed solution, based on @keramat's idea.

def rounder(x):
      # Fixed parameters, to be at least 5 hours in the interval from 22:00 to 07:00
      n = 5
      start_date = "22:00"
      end_date = "07:00"
      # assert (n+1) < time_slot
      time_1 = datetime.strptime(start_date,"%H:%M")
      time_2 = datetime.strptime(end_date,"%H:%M")
      time_slot = (time_2 - time_1).seconds // 3600
      v = pd.date_range(list(x)[-2], list(x)[-1], freq='1h')
      temp = pd.Series(v, index = v).between_time(start_date, end_date)
      temp = len(temp)/time_slot
      return np.floor(temp) if np.mod(temp, 1.0) < (n+1)/time_slot else np.ceil(temp)/time_slot

    g = (df['using_time'].notna()).sort_index(ascending=False).cumsum()
    g = (g-max(g)).abs()

    temp = df.groupby(g)['date'].apply(lambda x: rounder(x))
    
    #Up to this point, it runs perfectly.
    
    
    df.loc[df[df['using_time'].notna()].index, 'cost_days']=temp.values
    
    # ValueError: shape mismatch: value array of shape (8,) could not be broadcast to indexing result of shape (7,)
 
    df['cost_days'] = df['cost_days'].fillna(0)
    
    print(df)

I need to fix the error to get the result I want.

The resulting dataframe should look as follows:

date                  ID_bulb  switch   using_time          error    cost_days
2022-02-05 14:30:21+00:00   5   OFF       NaT               INIT         0
2022-02-27 15:30:21+00:00   5   ON        NaT               ERROR2       0
2022-02-27 17:05:21+00:00   5   OFF   0 days 01:35:00       OK           0
2022-04-07 15:30:21+00:00   5   ON        NaT               OK           0
2022-04-07 15:30:21+00:00   5   OFF   0 days 00:00:00       OK           0
2022-04-07 17:05:21+00:00   5   OFF       NaT               ERROR2       0
2022-04-06 15:30:21+00:00   4   ON        NaT               INIT         0
2022-04-06 15:35:21+00:00   4   OFF       NaT               ERROR1       0
2022-04-06 16:10:21+00:00   4   ON        NaT               ERROR2       0
2022-04-07 15:30:21+00:00   4   OFF   0 days 23:20:00       OK           1
2022-04-07 17:05:21+00:00   4   ON        NaT               ERROR2       0
2022-01-01 19:40:21+00:00   3   ON        NaT               INIT         0
2022-02-03 22:40:21+00:00   3   ON        NaT               ERROR2       0
2022-02-03 23:20:21+00:00   3   OFF   0 days 00:40:00       OK           0
2022-02-04 00:20:21+00:00   3   ON        NaT               OK           0
2022-02-04 14:30:21+00:00   3   ON        NaT               ERROR2       0
2022-02-04 15:30:21+00:00   3   ON        NaT               ERROR2       0
2022-02-04 15:35:21+00:00   3   OFF   0 days 00:05:00       OK           0
2022-02-04 15:40:21+00:00   3   OFF       NaT               ERROR2       0
2022-02-04 19:40:21+00:00   3   ON        NaT               OK           0
2022-02-06 15:35:21+00:00   3   OFF   1 days 19:55:00       OK           2
2022-02-28 18:40:21+00:00   3   ON        NaT               ERROR1       0
2022-10-12 18:40:21+00:00   3   OFF   226 days 00:00:00     OK           226
2022-02-04 09:10:21+00:00   2   ON        NaT               OK           0
2022-02-04 14:10:21+00:00   2   ON        NaT               ERROR2       0

Edit: I think the problem is that the input dataset must end with a non-null using_time value for the above code to work. How could I get the performance I want by fixing this problem?

like image 886
Carola Avatar asked Dec 27 '25 21:12

Carola


2 Answers

Use:

if df['using_time'].iloc[-1]=='NaT':
    g = g[g!=g.max()]

please note that as I made the df using your string the using time is of type string. In your case, probably you need to do:

if df['using_time'].iloc[-1].isna():
    g = g[g!=g.max()]

Demonstration:

string = """date   ID_bulb   switch   using_time   error
2022-02-05 14:30:21+00:00   5   OFF   NaT   INIT
2022-02-27 15:30:21+00:00   5   ON   NaT   ERROR2
2022-02-27 17:05:21+00:00   5   OFF   0 days 01:35:00   OK
2022-04-07 15:30:21+00:00   5   ON   NaT   OK
2022-04-07 15:30:21+00:00   5   OFF   0 days 00:00:00   OK
2022-04-07 17:05:21+00:00   5   OFF   NaT   ERROR2
2022-04-06 15:30:21+00:00   4   ON   NaT   INIT
2022-04-06 15:35:21+00:00   4   OFF   NaT   ERROR1
2022-04-06 16:10:21+00:00   4   ON   NaT   ERROR2
2022-04-07 15:30:21+00:00   4   OFF   0 days 23:20:00   OK
2022-04-07 17:05:21+00:00   4   ON   NaT   ERROR2
2022-01-01 19:40:21+00:00   3   ON   NaT   INIT
2022-02-03 22:40:21+00:00   3   ON   NaT   ERROR2
2022-02-03 23:20:21+00:00   3   OFF   0 days 00:40:00   OK
2022-02-04 00:20:21+00:00   3   ON   NaT   OK
2022-02-04 14:30:21+00:00   3   ON   NaT   ERROR2
2022-02-04 15:30:21+00:00   3   ON   NaT   ERROR2
2022-02-04 15:35:21+00:00   3   OFF   0 days 00:05:00   OK
2022-02-04 15:40:21+00:00   3   OFF   NaT   ERROR2
2022-02-04 19:40:21+00:00   3   ON   NaT   OK
2022-02-06 15:35:21+00:00   3   OFF   1 days 19:55:00   OK
2022-02-28 18:40:21+00:00   3   ON   NaT   ERROR1
2022-10-12 18:40:21+00:00   3   OFF   226 days 00:00:00   OK
2022-02-04 09:10:21+00:00   2   ON   NaT   OK
2022-02-04 14:10:21+00:00   2   ON   NaT   ERROR2"""

data = [x.split('   ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])

from datetime import datetime
import numpy as np

def rounder(x):
  # Fixed parameters, to be at least 5 hours in the interval from 22:00 to 07:00
  n = 5
  start_date = "22:00"
  end_date = "07:00"
  # assert (n+1) < time_slot
  time_1 = datetime.strptime(start_date,"%H:%M")
  time_2 = datetime.strptime(end_date,"%H:%M")
  time_slot = (time_2 - time_1).seconds // 3600
  v = pd.date_range(list(x)[-2], list(x)[-1], freq='1h')
  temp = pd.Series(v, index = v).between_time(start_date, end_date)
  temp = len(temp)/time_slot
  return np.floor(temp) if np.mod(temp, 1.0) < (n+1)/time_slot else np.ceil(temp)/time_slot

g = (df['using_time']!='NaT').sort_index(ascending=False).cumsum()
g = (g-max(g)).abs()
if df['using_time'].iloc[-1]=='NaT':
    g = g[g!=g.max()]
temp = df.groupby(g)['date'].apply(lambda x: rounder(x))  

df.loc[df[df['using_time']!='NaT'].index, 'cost_days']=temp.values

# ValueError: shape mismatch: value array of shape (8,) could not be broadcast to indexing result of shape (7,)

df['cost_days'] = df['cost_days'].fillna(0)

df

Output:

date    ID_bulb switch  using_time  error   cost_days
0   2022-02-05 14:30:21+00:00   5   OFF NaT INIT    0.0
1   2022-02-27 15:30:21+00:00   5   ON  NaT ERROR2  0.0
2   2022-02-27 17:05:21+00:00   5   OFF 0 days 01:35:00 OK  0.0
3   2022-04-07 15:30:21+00:00   5   ON  NaT OK  0.0
4   2022-04-07 15:30:21+00:00   5   OFF 0 days 00:00:00 OK  0.0
5   2022-04-07 17:05:21+00:00   5   OFF NaT ERROR2  0.0
6   2022-04-06 15:30:21+00:00   4   ON  NaT INIT    0.0
7   2022-04-06 15:35:21+00:00   4   OFF NaT ERROR1  0.0
8   2022-04-06 16:10:21+00:00   4   ON  NaT ERROR2  0.0
9   2022-04-07 15:30:21+00:00   4   OFF 0 days 23:20:00 OK  1.0
10  2022-04-07 17:05:21+00:00   4   ON  NaT ERROR2  0.0
11  2022-01-01 19:40:21+00:00   3   ON  NaT INIT    0.0
12  2022-02-03 22:40:21+00:00   3   ON  NaT ERROR2  0.0
13  2022-02-03 23:20:21+00:00   3   OFF 0 days 00:40:00 OK  0.0
14  2022-02-04 00:20:21+00:00   3   ON  NaT OK  0.0
15  2022-02-04 14:30:21+00:00   3   ON  NaT ERROR2  0.0
16  2022-02-04 15:30:21+00:00   3   ON  NaT ERROR2  0.0
17  2022-02-04 15:35:21+00:00   3   OFF 0 days 00:05:00 OK  0.0
18  2022-02-04 15:40:21+00:00   3   OFF NaT ERROR2  0.0
19  2022-02-04 19:40:21+00:00   3   ON  NaT OK  0.0
20  2022-02-06 15:35:21+00:00   3   OFF 1 days 19:55:00 OK  2.0
21  2022-02-28 18:40:21+00:00   3   ON  NaT ERROR1  0.0
22  2022-10-12 18:40:21+00:00   3   OFF 226 days 00:00:00   OK  226.0
23  2022-02-04 09:10:21+00:00   2   ON  NaT OK  0.0
24  2022-02-04 14:10:21+00:00   2   ON  NaT ERROR2  0.0
like image 86
keramat Avatar answered Dec 30 '25 11:12

keramat


Maybe you can use transform instead of apply to keep the same shape as input. After that, use where to select only rows where using_time is not null:

# temp = df.groupby(g)['date'].apply(lambda x: rounder(x))  
df['cost_day'] = (df.groupby(g)['date'].transform(lambda x: rounder(x))
                    .where(df['using_time'].notna(), other=0).astype(int))

Output:

>>> df
                        date  ID_bulb switch        using_time   error  cost_day
0  2022-02-05 14:30:21+00:00        5    OFF               NaT    INIT         0
1  2022-02-27 15:30:21+00:00        5     ON               NaT  ERROR2         0
2  2022-02-27 17:05:21+00:00        5    OFF   0 days 01:35:00      OK         0
3  2022-04-07 15:30:21+00:00        5     ON               NaT      OK         0
4  2022-04-07 15:30:21+00:00        5    OFF   0 days 00:00:00      OK         0
5  2022-04-07 17:05:21+00:00        5    OFF               NaT  ERROR2         0
6  2022-04-06 15:30:21+00:00        4     ON               NaT    INIT         0
7  2022-04-06 15:35:21+00:00        4    OFF               NaT  ERROR1         0
8  2022-04-06 16:10:21+00:00        4     ON               NaT  ERROR2         0
9  2022-04-07 15:30:21+00:00        4    OFF   0 days 23:20:00      OK         1
10 2022-04-07 17:05:21+00:00        4     ON               NaT  ERROR2         0
11 2022-01-01 19:40:21+00:00        3     ON               NaT    INIT         0
12 2022-02-03 22:40:21+00:00        3     ON               NaT  ERROR2         0
13 2022-02-03 23:20:21+00:00        3    OFF   0 days 00:40:00      OK         0
14 2022-02-04 00:20:21+00:00        3     ON               NaT      OK         0
15 2022-02-04 14:30:21+00:00        3     ON               NaT  ERROR2         0
16 2022-02-04 15:30:21+00:00        3     ON               NaT  ERROR2         0
17 2022-02-04 15:35:21+00:00        3    OFF   0 days 00:05:00      OK         0
18 2022-02-04 15:40:21+00:00        3    OFF               NaT  ERROR2         0
19 2022-02-04 19:40:21+00:00        3     ON               NaT      OK         0
20 2022-02-06 15:35:21+00:00        3    OFF   1 days 19:55:00      OK         2
21 2022-02-28 18:40:21+00:00        3     ON               NaT  ERROR1         0
22 2022-10-12 18:40:21+00:00        3    OFF 226 days 00:00:00      OK       226
23 2022-02-04 09:10:21+00:00        2     ON               NaT      OK         0
24 2022-02-04 14:10:21+00:00        2     ON               NaT  ERROR2         0
like image 31
Corralien Avatar answered Dec 30 '25 10:12

Corralien



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!