Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unstacking shift data (start and end time) into hourly data

Tags:

python

pandas

I have a df as follows which shows when a person started a shift, ended a shift, the amount of hours and the date worked.

Business_Date   Number PayTimeStart PayTimeEnd          Hours
0   2019-05-24  1       2019-05-24 11:00:00 2019-05-24 12:15:00 1.250
1   2019-05-24  2       2019-05-24 12:30:00 2019-05-24 13:30:00 1.00

Now what I'm trying to do is break this into an hourly format, so I know how many hours were used between 11:00 - 12:00

so, in my head, for the above, I want to put the 1 hour between 11 - 12 into the bin for 11:00 and the remainder 0.25 into the next bin of 12

so I would end up with something like

    Business Date   Time Hour
0   2019-05-24  11:00 1
1   2019-05-24  12:00 0.75
2   2019-05-24  13:00 0.5
like image 516
Umar.H Avatar asked May 28 '19 11:05

Umar.H


3 Answers

One idea is working with minutes - first use list comprehension with flattening for Series and then grouping by hours with hours for count by GroupBy.size and last divide by 60 for final hours:

s = pd.Series([z for x, y in zip(df['Pay Time Start'], 
                                 df['Pay Time End'] - pd.Timedelta(60, unit='s')) 
                 for z in pd.date_range(x, y, freq='Min')])

df = (s.groupby([s.dt.date.rename('Business Date'), s.dt.hour.rename('Time')])
       .size()
       .div(60)
       .reset_index(name='Hour'))
print (df)
  Business Date  Time  Hour
0    2019-05-24    11  1.00
1    2019-05-24    12  0.75
2    2019-05-24    13  0.50

If you need to group by a location or ID

 df1 = pd.DataFrame([(z, w) for x, y, w in zip(df['Pay Time Start'], 
                                              df['Pay Time End'] - pd.Timedelta(60, unit='s'), 
                                              df['Location']) for z in pd.date_range(x, y, freq='Min')], 
                   columns=['Date','Location']) 

 df = (df1.groupby([df1['Date'].dt.date.rename('Business Date'), 
                       df1['Date'].dt.hour.rename('Time'), df1['Location']]) 
          .size() .div(60) .reset_index(name='Hour'))
like image 140
jezrael Avatar answered Oct 23 '22 08:10

jezrael


Another idea, similar's to @jezrael's but work with seconds for higher precision:

def get_series(a):
    s, e, h = a
    idx = pd.date_range(s,e, freq='6s')
    return pd.Series(h/len(idx), index=idx)

(pd.concat(map(get_series, zip(df.Pay_Time_Start,
                          df.Pay_Time_End, 
                          df.Hours)))
   .resample('H').sum()
)

Output:

2019-05-24 11:00:00    0.998668
2019-05-24 12:00:00    0.750500
2019-05-24 13:00:00    0.500832
Freq: H, dtype: float64
like image 2
Quang Hoang Avatar answered Oct 23 '22 07:10

Quang Hoang


Another idea just for your convenience (plus I like challenging questions) is using melt and then conditionally calculating the minutes:

Basically, you have two formulas for your calculations (Pseudocode):

  • Minutes in Pay Time Start: 60 - minutes in df['Pay Time Start]
  • Minutes in Pay Time End: minutes in df['Pay Time End]

So we can use these formulas to create our new data:

First we melt our Times in one column

new = df.melt(id_vars=['Business Date', 'Number'], 
              value_vars=['Pay Time Start', 'Pay Time End'],
              var_name='Pay Time Name',
              value_name='Pay Time Date').sort_values('Number')

# Apply the formulas noted above
new['Minutes'] = np.where(new['Pay Time Name'].eq('Pay Time Start'), 
                          60 - new['Pay Time Date'].dt.minute, 
                          new['Pay Time Date'].dt.minute)

# Out
  Business Date  Number   Pay Time Name       Pay Time Date  Minutes
0    2019-05-24       1  Pay Time Start 2019-05-24 11:00:00       60
2    2019-05-24       1    Pay Time End 2019-05-24 12:15:00       15
1    2019-05-24       2  Pay Time Start 2019-05-24 12:30:00       30
3    2019-05-24       2    Pay Time End 2019-05-24 13:30:00       30

Now we calculate the amount of hours with groupby:

daterange = pd.date_range(df['Pay Time Start'].min(), df['Pay Time End'].max(), freq='H')

df_new = pd.DataFrame({'Date':daterange.date,
                    'Time':daterange.time}, dtype='datetime64[ns]')

df_new['Hours'] = (new.groupby(new['Pay Time Date'].dt.hour)['Minutes'].sum()/60).to_numpy()

Final Output

        Date      Time  Hours
0 2019-05-24  11:00:00   1.00
1 2019-05-24  12:00:00   0.75
2 2019-05-24  13:00:00   0.50
like image 1
Erfan Avatar answered Oct 23 '22 07:10

Erfan