I'm working with a data set containing information on a phenomenon occurring during some time frames. I am given the start and end time of the event and its severity, as well as some other information. I would like to expand these frames over some larger time period by expanding the rows within set time periods and leaving the rest of the information as NaNs.
Data set example:
date_end severity category
date_start
2018-01-04 07:00:00 2018-01-04 10:00:00 12 1
2018-01-04 12:00:00 2018-01-04 13:00:00 44 2
What I want is:
severity category
date_start
2018-01-04 07:00:00 12 1
2018-01-04 08:00:00 12 1
2018-01-04 09:00:00 12 1
2018-01-04 10:00:00 12 1
2018-01-04 11:00:00 nan nan
2018-01-04 12:00:00 44 2
2018-01-04 13:00:00 44 2
2018-01-04 14:00:00 nan nan
2018-01-04 15:00:00 nan nan
What would be an efficient way of achieving such a result?
Specify start and end , with the default daily frequency. Specify start and periods , the number of periods (days). Specify end and periods , the number of periods (days). Specify start , end , and periods ; the frequency is generated automatically (linearly spaced).
To select the rows, the syntax is df. loc[start:stop:step] ; where start is the name of the first-row label to take, stop is the name of the last row label to take, and step as the number of indices to advance after each extraction; for example, you can use it to select alternate rows.
You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.
Pandas DataFrame add() Method The add() method adds each value in the DataFrame with a specified value. The specified value must be an object that can be added to the values of the DataFrame.
Assuming you are on pandas v0.25, use explode
:
df['hour'] = df.apply(lambda row: pd.date_range(row.name, row['date_end'], freq='H'), axis=1)
df = df.explode('hour').reset_index() \
.drop(columns=['date_start', 'date_end']) \
.rename(columns={'hour': 'date_start'}) \
.set_index('date_start')
For the rows with nan
, you may reindex your dataframe.
# Report from Jan 4 - 5, 2018, from 7AM - 7PM
days = pd.date_range('2018-01-04', '2018-01-05')
hours = pd.to_timedelta(range(7, 20), unit='h')
tmp = pd.MultiIndex.from_product([days, hours], names=['Date', 'Hour']).to_frame()
s = tmp['Date'] + tmp['Hour']
df.reindex(s)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With