I have a pandas dataframe that represents a shift schedule for an entire year, given as:
January 2019 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Shift A 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1
Shift B 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2
Shift C 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0
Shift D 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0
February 2019 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 nan nan nan
Shift A 0 0 0 2 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1 nan nan nan
Shift B 0 0 0 0 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2 nan nan nan
Shift C 2 2 2 0 1 1 1 0 0 0 2 2 0 0 1 1 1 1 0 2 2 0 0 0 0 0 0 0 nan nan nan
Shift D 1 1 1 1 0 2 2 0 0 0 0 0 0 0 2 2 2 0 1 1 1 0 0 0 2 2 0 0 nan nan nan
Where 1 represents Day shift (06:00 - 18:00), 2 represents Night shift (18:00 - 06:00) and 0 can be ignored. Only a single shift team will be working for a given period.
I need the data in a format where the data is indexed by the DateTime stamp with the current working shift, e.g. :
DateTime Shift
0 2019-01-01 06:00:00 A
1 2019-01-01 18:00:00 D
2 2019-01-02 06:00:00 A
3 2019-01-02 18:00:00 B
4 2019-01-03 06:00:00 A
5 2019-01-03 18:00:00 B
.
.
.
What would be the most efficient Pandas method to re-index the data to achieve this, i.e. avoiding for loops?
One can reindex a single column or multiple columns by using reindex() method and by specifying the axis we want to reindex. Default values in the new index that are not present in the dataframe are assigned NaN.
shift() If you want to shift your column or subtract the column value with the previous row value from the DataFrame, you can do it by using the shift() function. It consists of a scalar parameter called period, which is responsible for showing the number of shifts to be made over the desired axis.
The reindex() method allows you to change the row indexes, and the columns labels. ;] Note: The values are set to NaN if the new index is not the same as the old.
Use:
#get first column by position
first = df.iloc[:, 0]
#convert column to datetimes with missing values for no datetimes values
dates = pd.to_datetime(first, errors='coerce')
#mask for data row
mask = dates.isna()
#forward filling missing values and replace first NaNs by first column name
df.index = dates.ffill().fillna(pd.to_datetime(first.name))
#filter out rows with datetimes in first column, add first column to index
df = df[mask.values].set_index(first.name, append=True)
#convert columns names to timedeltas in days, first is 0 days
df.columns = pd.to_timedelta(df.columns.astype(int) - 1, unit='D')
#dictionary for map 1, 2 values
mapp = {1: pd.Timedelta('06:00:00'), 2:pd.Timedelta('18:00:00')}
#remove 0 rows with convert to NaN by mask and reshape by stack
#map by dict and convert MultiIndex to columns
df = (df.mask(df == 0)
.stack()
.map(mapp)
.rename_axis(('Datetime','Shift', 'day'))
.reset_index(name='td')
)
#add days to hours and add to Datetime
df['Datetime'] += (df.pop('td') + df.pop('day'))
#sorting ans create default index
df = df.sort_values(['Datetime','Shift']).reset_index(drop=True)
print (df)
Datetime Shift
0 2019-01-01 06:00:00 Shift A
1 2019-01-01 18:00:00 Shift D
2 2019-01-02 06:00:00 Shift A
3 2019-01-02 18:00:00 Shift B
4 2019-01-03 06:00:00 Shift A
.. ... ...
113 2019-02-26 18:00:00 Shift D
114 2019-02-27 06:00:00 Shift A
115 2019-02-27 18:00:00 Shift B
116 2019-02-28 06:00:00 Shift A
117 2019-02-28 18:00:00 Shift B
[118 rows x 2 columns]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With