I am trying to analyze average daily fluctuations in a measurement "X" over several weeks using pandas dataframes, however timestamps/datetimes etc. are proving particularly hellish to deal with. Having spent a good few hours trying to work this out my code is getting messier and messier and I don't think I'm any closer to a solution, hoping someone here can guide me in the right direction.
I have measured X at different times and on different days, saving the daily results to a dataframe which has the form:
Timestamp(datetime64) X
0 2015-10-05 00:01:38 1
1 2015-10-05 06:03:39 4
2 2015-10-05 13:42:39 3
3 2015-10-05 22:15:39 2
As the time the measurement is made at changes from day to day I decided to use binning to organize the data, and then work out averages and STD for each bin which I can then plot. My idea was to create a final dataframe with bins and the average value of X for the measurements, the 'Observations' column is just to aid understanding:
Time Bin Observations <X>
0 00:00-05:59 [ 1 , ...] 2.3
1 06:00-11:59 [ 4 , ...] 4.6
2 12:00-17:59 [ 3 , ...] 8.5
3 18:00-23:59 [ 2 , ...] 3.1
However I've run into difficulties with incompatibility between time, datetime, datetime64, timedelta and binning using pd.cut
and pd.groupby
, basically I feel like I'm making stabs in the dark with no idea as to the the 'right' way to approach this problem. The only solution I can think of is a row-by-row iteration through the dataframe but I'd really like to avoid having to do this.
To remove timezone from tz-aware DatetimeIndex , use tz_localize(None) or tz_convert(None) . tz_localize(None) will remove timezone holding local time representations. tz_convert(None) will remove timezone after converting to UTC time.
If your DataFrame holds the DateTime in a string column in a specific format, you can convert it by using to_datetime() function as it accepts the format param to specify the format date & time.
pandas.DataFrame
is to use pandas.cut
datetime
format with pandas.to_datetime
..dt.hour
to extract the hour, for use in the .cut
method.python 3.8.11
and pandas 1.3.1
bin
the dataimport pandas as pd
import numpy as np # for test data
import random # for test data
# setup a sample dataframe; creates 1.5 months of hourly observations
np.random.seed(365)
random.seed(365)
data = {'date': pd.bdate_range('2020-09-21', freq='h', periods=1100).tolist(),
'x': np.random.randint(10, size=(1100))}
df = pd.DataFrame(data)
# the date column of the sample data is already in a datetime format
# if the date column is not a datetime, then uncomment the following line
# df.date= pd.to_datetime(df.date)
# define the bins
bins = [0, 6, 12, 18, 24]
# add custom labels if desired
labels = ['00:00-05:59', '06:00-11:59', '12:00-17:59', '18:00-23:59']
# add the bins to the dataframe
df['Time Bin'] = pd.cut(df.date.dt.hour, bins, labels=labels, right=False)
# display(df.head())
date x Time Bin
0 2020-09-21 00:00:00 2 00:00-05:59
1 2020-09-21 01:00:00 4 00:00-05:59
2 2020-09-21 02:00:00 1 00:00-05:59
3 2020-09-21 03:00:00 5 00:00-05:59
4 2020-09-21 04:00:00 2 00:00-05:59
# display(df.tail())
date x Time Bin
1095 2020-11-05 15:00:00 2 12:00-17:59
1096 2020-11-05 16:00:00 3 12:00-17:59
1097 2020-11-05 17:00:00 1 12:00-17:59
1098 2020-11-05 18:00:00 2 18:00-23:59
1099 2020-11-05 19:00:00 2 18:00-23:59
'Time Bin'
pandas.DataFrame.groupby
on 'Time Bin'
, and then aggregate 'x'
into a list
and mean
.# groupby Time Bin and aggregate a list for the observations, and mean
dfg = df.groupby('Time Bin', as_index=False)['x'].agg([list, 'mean'])
# change the column names, if desired
dfg.columns = ['X Observations', 'X mean']
# display(dfg)
X Observations X mean
Time Bin
00:00-05:59 [2, 4, 1, 5, 2, 2, ...] 4.416667
06:00-11:59 [9, 8, 4, 0, 3, 3, ...] 4.760870
12:00-17:59 [7, 7, 7, 0, 8, 4, ...] 4.384058
18:00-23:59 [3, 2, 6, 2, 6, 8, ...] 4.459559
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With