Given the following DataFrame of pandas in Python:
date ID_bulb switch using_time error
2022-02-05 14:30:21+00:00 5 OFF NaT INIT
2022-02-27 15:30:21+00:00 5 ON NaT ERROR2
2022-02-27 17:05:21+00:00 5 OFF 0 days 01:35:00 OK
2022-04-07 15:30:21+00:00 5 ON NaT OK
2022-04-07 15:30:21+00:00 5 OFF 0 days 00:00:00 OK
2022-04-07 17:05:21+00:00 5 OFF NaT ERROR2
2022-04-06 15:30:21+00:00 4 ON NaT INIT
2022-04-06 15:35:21+00:00 4 OFF NaT ERROR1
2022-04-06 16:10:21+00:00 4 ON NaT ERROR2
2022-04-07 15:30:21+00:00 4 OFF 0 days 23:20:00 OK
2022-04-07 17:05:21+00:00 4 ON NaT ERROR2
2022-01-01 19:40:21+00:00 3 ON NaT INIT
2022-02-03 22:40:21+00:00 3 ON NaT ERROR2
2022-02-03 23:20:21+00:00 3 OFF 0 days 00:40:00 OK
2022-02-04 00:20:21+00:00 3 ON NaT OK
2022-02-04 14:30:21+00:00 3 ON NaT ERROR2
2022-02-04 15:30:21+00:00 3 ON NaT ERROR2
2022-02-04 15:35:21+00:00 3 OFF 0 days 00:05:00 OK
2022-02-04 15:40:21+00:00 3 OFF NaT ERROR2
2022-02-04 19:40:21+00:00 3 ON NaT OK
2022-02-06 15:35:21+00:00 3 OFF 1 days 19:55:00 OK
2022-02-28 18:40:21+00:00 3 ON NaT ERROR1
2022-10-12 18:40:21+00:00 3 OFF 226 days 00:00:00 OK
2022-02-04 09:10:21+00:00 2 ON NaT OK
2022-02-04 14:10:21+00:00 2 ON NaT ERROR2
Problem to solve: I want to add a new column, called cost_days. This column will include only for rows where the variable using_time is different from NaT. Information on how many times the light bulb has been on during at least n hours in a row between a night period defined by start_time to end_time.
Proposed solution, based on @keramat's idea.
def rounder(x):
# Fixed parameters, to be at least 5 hours in the interval from 22:00 to 07:00
n = 5
start_date = "22:00"
end_date = "07:00"
# assert (n+1) < time_slot
time_1 = datetime.strptime(start_date,"%H:%M")
time_2 = datetime.strptime(end_date,"%H:%M")
time_slot = (time_2 - time_1).seconds // 3600
v = pd.date_range(list(x)[-2], list(x)[-1], freq='1h')
temp = pd.Series(v, index = v).between_time(start_date, end_date)
temp = len(temp)/time_slot
return np.floor(temp) if np.mod(temp, 1.0) < (n+1)/time_slot else np.ceil(temp)/time_slot
g = (df['using_time'].notna()).sort_index(ascending=False).cumsum()
g = (g-max(g)).abs()
temp = df.groupby(g)['date'].apply(lambda x: rounder(x))
#Up to this point, it runs perfectly.
df.loc[df[df['using_time'].notna()].index, 'cost_days']=temp.values
# ValueError: shape mismatch: value array of shape (8,) could not be broadcast to indexing result of shape (7,)
df['cost_days'] = df['cost_days'].fillna(0)
print(df)
I need to fix the error to get the result I want.
The resulting dataframe should look as follows:
date ID_bulb switch using_time error cost_days
2022-02-05 14:30:21+00:00 5 OFF NaT INIT 0
2022-02-27 15:30:21+00:00 5 ON NaT ERROR2 0
2022-02-27 17:05:21+00:00 5 OFF 0 days 01:35:00 OK 0
2022-04-07 15:30:21+00:00 5 ON NaT OK 0
2022-04-07 15:30:21+00:00 5 OFF 0 days 00:00:00 OK 0
2022-04-07 17:05:21+00:00 5 OFF NaT ERROR2 0
2022-04-06 15:30:21+00:00 4 ON NaT INIT 0
2022-04-06 15:35:21+00:00 4 OFF NaT ERROR1 0
2022-04-06 16:10:21+00:00 4 ON NaT ERROR2 0
2022-04-07 15:30:21+00:00 4 OFF 0 days 23:20:00 OK 1
2022-04-07 17:05:21+00:00 4 ON NaT ERROR2 0
2022-01-01 19:40:21+00:00 3 ON NaT INIT 0
2022-02-03 22:40:21+00:00 3 ON NaT ERROR2 0
2022-02-03 23:20:21+00:00 3 OFF 0 days 00:40:00 OK 0
2022-02-04 00:20:21+00:00 3 ON NaT OK 0
2022-02-04 14:30:21+00:00 3 ON NaT ERROR2 0
2022-02-04 15:30:21+00:00 3 ON NaT ERROR2 0
2022-02-04 15:35:21+00:00 3 OFF 0 days 00:05:00 OK 0
2022-02-04 15:40:21+00:00 3 OFF NaT ERROR2 0
2022-02-04 19:40:21+00:00 3 ON NaT OK 0
2022-02-06 15:35:21+00:00 3 OFF 1 days 19:55:00 OK 2
2022-02-28 18:40:21+00:00 3 ON NaT ERROR1 0
2022-10-12 18:40:21+00:00 3 OFF 226 days 00:00:00 OK 226
2022-02-04 09:10:21+00:00 2 ON NaT OK 0
2022-02-04 14:10:21+00:00 2 ON NaT ERROR2 0
Edit: I think the problem is that the input dataset must end with a non-null using_time value for the above code to work. How could I get the performance I want by fixing this problem?
Use:
if df['using_time'].iloc[-1]=='NaT':
g = g[g!=g.max()]
please note that as I made the df using your string the using time is of type string. In your case, probably you need to do:
if df['using_time'].iloc[-1].isna():
g = g[g!=g.max()]
Demonstration:
string = """date ID_bulb switch using_time error
2022-02-05 14:30:21+00:00 5 OFF NaT INIT
2022-02-27 15:30:21+00:00 5 ON NaT ERROR2
2022-02-27 17:05:21+00:00 5 OFF 0 days 01:35:00 OK
2022-04-07 15:30:21+00:00 5 ON NaT OK
2022-04-07 15:30:21+00:00 5 OFF 0 days 00:00:00 OK
2022-04-07 17:05:21+00:00 5 OFF NaT ERROR2
2022-04-06 15:30:21+00:00 4 ON NaT INIT
2022-04-06 15:35:21+00:00 4 OFF NaT ERROR1
2022-04-06 16:10:21+00:00 4 ON NaT ERROR2
2022-04-07 15:30:21+00:00 4 OFF 0 days 23:20:00 OK
2022-04-07 17:05:21+00:00 4 ON NaT ERROR2
2022-01-01 19:40:21+00:00 3 ON NaT INIT
2022-02-03 22:40:21+00:00 3 ON NaT ERROR2
2022-02-03 23:20:21+00:00 3 OFF 0 days 00:40:00 OK
2022-02-04 00:20:21+00:00 3 ON NaT OK
2022-02-04 14:30:21+00:00 3 ON NaT ERROR2
2022-02-04 15:30:21+00:00 3 ON NaT ERROR2
2022-02-04 15:35:21+00:00 3 OFF 0 days 00:05:00 OK
2022-02-04 15:40:21+00:00 3 OFF NaT ERROR2
2022-02-04 19:40:21+00:00 3 ON NaT OK
2022-02-06 15:35:21+00:00 3 OFF 1 days 19:55:00 OK
2022-02-28 18:40:21+00:00 3 ON NaT ERROR1
2022-10-12 18:40:21+00:00 3 OFF 226 days 00:00:00 OK
2022-02-04 09:10:21+00:00 2 ON NaT OK
2022-02-04 14:10:21+00:00 2 ON NaT ERROR2"""
data = [x.split(' ') for x in string.split('\n')]
df = pd.DataFrame(data[1:], columns = data[0])
from datetime import datetime
import numpy as np
def rounder(x):
# Fixed parameters, to be at least 5 hours in the interval from 22:00 to 07:00
n = 5
start_date = "22:00"
end_date = "07:00"
# assert (n+1) < time_slot
time_1 = datetime.strptime(start_date,"%H:%M")
time_2 = datetime.strptime(end_date,"%H:%M")
time_slot = (time_2 - time_1).seconds // 3600
v = pd.date_range(list(x)[-2], list(x)[-1], freq='1h')
temp = pd.Series(v, index = v).between_time(start_date, end_date)
temp = len(temp)/time_slot
return np.floor(temp) if np.mod(temp, 1.0) < (n+1)/time_slot else np.ceil(temp)/time_slot
g = (df['using_time']!='NaT').sort_index(ascending=False).cumsum()
g = (g-max(g)).abs()
if df['using_time'].iloc[-1]=='NaT':
g = g[g!=g.max()]
temp = df.groupby(g)['date'].apply(lambda x: rounder(x))
df.loc[df[df['using_time']!='NaT'].index, 'cost_days']=temp.values
# ValueError: shape mismatch: value array of shape (8,) could not be broadcast to indexing result of shape (7,)
df['cost_days'] = df['cost_days'].fillna(0)
df
Output:
date ID_bulb switch using_time error cost_days
0 2022-02-05 14:30:21+00:00 5 OFF NaT INIT 0.0
1 2022-02-27 15:30:21+00:00 5 ON NaT ERROR2 0.0
2 2022-02-27 17:05:21+00:00 5 OFF 0 days 01:35:00 OK 0.0
3 2022-04-07 15:30:21+00:00 5 ON NaT OK 0.0
4 2022-04-07 15:30:21+00:00 5 OFF 0 days 00:00:00 OK 0.0
5 2022-04-07 17:05:21+00:00 5 OFF NaT ERROR2 0.0
6 2022-04-06 15:30:21+00:00 4 ON NaT INIT 0.0
7 2022-04-06 15:35:21+00:00 4 OFF NaT ERROR1 0.0
8 2022-04-06 16:10:21+00:00 4 ON NaT ERROR2 0.0
9 2022-04-07 15:30:21+00:00 4 OFF 0 days 23:20:00 OK 1.0
10 2022-04-07 17:05:21+00:00 4 ON NaT ERROR2 0.0
11 2022-01-01 19:40:21+00:00 3 ON NaT INIT 0.0
12 2022-02-03 22:40:21+00:00 3 ON NaT ERROR2 0.0
13 2022-02-03 23:20:21+00:00 3 OFF 0 days 00:40:00 OK 0.0
14 2022-02-04 00:20:21+00:00 3 ON NaT OK 0.0
15 2022-02-04 14:30:21+00:00 3 ON NaT ERROR2 0.0
16 2022-02-04 15:30:21+00:00 3 ON NaT ERROR2 0.0
17 2022-02-04 15:35:21+00:00 3 OFF 0 days 00:05:00 OK 0.0
18 2022-02-04 15:40:21+00:00 3 OFF NaT ERROR2 0.0
19 2022-02-04 19:40:21+00:00 3 ON NaT OK 0.0
20 2022-02-06 15:35:21+00:00 3 OFF 1 days 19:55:00 OK 2.0
21 2022-02-28 18:40:21+00:00 3 ON NaT ERROR1 0.0
22 2022-10-12 18:40:21+00:00 3 OFF 226 days 00:00:00 OK 226.0
23 2022-02-04 09:10:21+00:00 2 ON NaT OK 0.0
24 2022-02-04 14:10:21+00:00 2 ON NaT ERROR2 0.0
Maybe you can use transform instead of apply to keep the same shape as input. After that, use where to select only rows where using_time is not null:
# temp = df.groupby(g)['date'].apply(lambda x: rounder(x))
df['cost_day'] = (df.groupby(g)['date'].transform(lambda x: rounder(x))
.where(df['using_time'].notna(), other=0).astype(int))
Output:
>>> df
date ID_bulb switch using_time error cost_day
0 2022-02-05 14:30:21+00:00 5 OFF NaT INIT 0
1 2022-02-27 15:30:21+00:00 5 ON NaT ERROR2 0
2 2022-02-27 17:05:21+00:00 5 OFF 0 days 01:35:00 OK 0
3 2022-04-07 15:30:21+00:00 5 ON NaT OK 0
4 2022-04-07 15:30:21+00:00 5 OFF 0 days 00:00:00 OK 0
5 2022-04-07 17:05:21+00:00 5 OFF NaT ERROR2 0
6 2022-04-06 15:30:21+00:00 4 ON NaT INIT 0
7 2022-04-06 15:35:21+00:00 4 OFF NaT ERROR1 0
8 2022-04-06 16:10:21+00:00 4 ON NaT ERROR2 0
9 2022-04-07 15:30:21+00:00 4 OFF 0 days 23:20:00 OK 1
10 2022-04-07 17:05:21+00:00 4 ON NaT ERROR2 0
11 2022-01-01 19:40:21+00:00 3 ON NaT INIT 0
12 2022-02-03 22:40:21+00:00 3 ON NaT ERROR2 0
13 2022-02-03 23:20:21+00:00 3 OFF 0 days 00:40:00 OK 0
14 2022-02-04 00:20:21+00:00 3 ON NaT OK 0
15 2022-02-04 14:30:21+00:00 3 ON NaT ERROR2 0
16 2022-02-04 15:30:21+00:00 3 ON NaT ERROR2 0
17 2022-02-04 15:35:21+00:00 3 OFF 0 days 00:05:00 OK 0
18 2022-02-04 15:40:21+00:00 3 OFF NaT ERROR2 0
19 2022-02-04 19:40:21+00:00 3 ON NaT OK 0
20 2022-02-06 15:35:21+00:00 3 OFF 1 days 19:55:00 OK 2
21 2022-02-28 18:40:21+00:00 3 ON NaT ERROR1 0
22 2022-10-12 18:40:21+00:00 3 OFF 226 days 00:00:00 OK 226
23 2022-02-04 09:10:21+00:00 2 ON NaT OK 0
24 2022-02-04 14:10:21+00:00 2 ON NaT ERROR2 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With