I have a python dataframe like
Out[110]:
Time
2014-09-19 21:59:14 55.975
2014-09-19 21:56:08 55.925
2014-09-19 21:53:05 55.950
2014-09-19 21:50:29 55.950
2014-09-19 21:50:03 55.925
2014-09-19 21:47:00 56.150
2014-09-19 21:53:57 56.225
2014-09-19 21:40:51 56.225
2014-09-19 21:37:50 56.300
2014-09-19 21:34:46 56.300
2014-09-19 21:31:41 56.350
2014-09-19 21:30:08 56.500
2014-09-19 21:28:39 56.375
2014-09-19 21:25:34 56.350
2014-09-19 21:22:32 56.400
2014-09-19 21:19:27 56.325
2014-09-19 21:16:25 56.325
2014-09-19 21:13:21 56.350
2014-09-19 21:10:18 56.425
2014-09-19 21:07:13 56.475
Name: Spread, dtype: float64
which extends over long time periods (months to years) so with very many observation for each day. What I want to do is that I for each day want to retrieve the time series observation closest to a specific time, say 16:00.
My approach so far have been
eodsearch = pd.DataFrame(df['Date'] + datetime.timedelta(hours=16))
eod = df.iloc[df.index.get_loc(eodsearch['Date'] ,method='nearest')]
which currently gives me an error of
"Cannot convert input [Time Date, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
Moreover I saw that get_loc also accepted tolerance as an input so if I could set tolerance to say 30 min that would be great as well.
Any advice on why my code fails or how to fix it?
from pandas.tseries.offsets import Hour
df.sort_index(inplace=True) # Sort indices of original DF if not in sorted order
# Create a lookup dataframe whose index is offsetted by 16 hours
d = pd.DataFrame(dict(Time=pd.unique(df.index.date) + Hour(16)))
(i): use reindex
which supports both ways lookup of observations: (both ways compatible)
# Find values in original within +/- 30 minute interval of lookup
df.reindex(d['Time'], method='nearest', tolerance=pd.Timedelta('30Min'))
(ii) : use merge_asof
after identifying unique dates in the original DF
: (backward compatible)
# Find values in original within 30 minute interval of lookup (backwards)
pd.merge_asof(d, df.reset_index(), on='Time', tolerance=pd.Timedelta('30Min'))
(iii): To obtain dates ranging from +/-
30 minute bandwidth interval by querying and reindexing:
Index.get_loc
operates on a single label inputted, hence an entire series object cannot be passed directly to it.
Instead, DatetimeIndex.indexer_between_time
which gives all rows that lie within the specified start_time
& end_time
of the indices day-wise would be more suitable for this purpose. (Both endpoints are inclusive)
# Tolerance of +/- 30 minutes from 16:00:00
df.iloc[df.index.indexer_between_time("15:30:00", "16:30:00")]
Data used to arrive at the result:
idx = pd.date_range('1/1/2017', periods=200, freq='20T', name='Time')
np.random.seed(42)
df = pd.DataFrame(dict(observation=np.random.uniform(50,60,200)), idx)
# Shuffle indices
df = df.sample(frac=1., random_state=42)
Info:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 200 entries, 2017-01-02 07:40:00 to 2017-01-02 10:00:00
Data columns (total 1 columns):
observation 200 non-null float64
dtypes: float64(1)
memory usage: 3.1 KB
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With