Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find daily observation closest to specific time for irregularly spaced data

I have a python dataframe like

Out[110]:
Time
2014-09-19 21:59:14    55.975
2014-09-19 21:56:08    55.925
2014-09-19 21:53:05    55.950
2014-09-19 21:50:29    55.950
2014-09-19 21:50:03    55.925
2014-09-19 21:47:00    56.150
2014-09-19 21:53:57    56.225
2014-09-19 21:40:51    56.225
2014-09-19 21:37:50    56.300
2014-09-19 21:34:46    56.300
2014-09-19 21:31:41    56.350
2014-09-19 21:30:08    56.500
2014-09-19 21:28:39    56.375
2014-09-19 21:25:34    56.350
2014-09-19 21:22:32    56.400
2014-09-19 21:19:27    56.325
2014-09-19 21:16:25    56.325
2014-09-19 21:13:21    56.350
2014-09-19 21:10:18    56.425
2014-09-19 21:07:13    56.475
Name: Spread, dtype: float64

which extends over long time periods (months to years) so with very many observation for each day. What I want to do is that I for each day want to retrieve the time series observation closest to a specific time, say 16:00.

My approach so far have been

eodsearch = pd.DataFrame(df['Date'] + datetime.timedelta(hours=16))

eod = df.iloc[df.index.get_loc(eodsearch['Date'] ,method='nearest')]

which currently gives me an error of

"Cannot convert input [Time Date, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp 

Moreover I saw that get_loc also accepted tolerance as an input so if I could set tolerance to say 30 min that would be great as well.

Any advice on why my code fails or how to fix it?

like image 482
thevaluebay Avatar asked Feb 13 '17 15:02

thevaluebay


1 Answers

Preparing data:

from pandas.tseries.offsets import Hour

df.sort_index(inplace=True)  # Sort indices of original DF if not in sorted order
# Create a lookup dataframe whose index is offsetted by 16 hours
d = pd.DataFrame(dict(Time=pd.unique(df.index.date) + Hour(16)))

(i): use reindex which supports both ways lookup of observations: (both ways compatible)

# Find values in original within +/- 30 minute interval of lookup 
df.reindex(d['Time'], method='nearest', tolerance=pd.Timedelta('30Min'))

enter image description here


(ii) : use merge_asof after identifying unique dates in the original DF: (backward compatible)

# Find values in original within 30 minute interval of lookup (backwards)
pd.merge_asof(d, df.reset_index(), on='Time', tolerance=pd.Timedelta('30Min'))

enter image description here


(iii): To obtain dates ranging from +/- 30 minute bandwidth interval by querying and reindexing:

Index.get_loc operates on a single label inputted, hence an entire series object cannot be passed directly to it.

Instead, DatetimeIndex.indexer_between_time which gives all rows that lie within the specified start_time & end_time of the indices day-wise would be more suitable for this purpose. (Both endpoints are inclusive)


# Tolerance of +/- 30 minutes from 16:00:00
df.iloc[df.index.indexer_between_time("15:30:00", "16:30:00")]

enter image description here

Data used to arrive at the result:

idx = pd.date_range('1/1/2017', periods=200, freq='20T', name='Time')
np.random.seed(42)
df = pd.DataFrame(dict(observation=np.random.uniform(50,60,200)), idx)
# Shuffle indices
df = df.sample(frac=1., random_state=42)

Info:

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 200 entries, 2017-01-02 07:40:00 to 2017-01-02 10:00:00
Data columns (total 1 columns):
observation    200 non-null float64
dtypes: float64(1)
memory usage: 3.1 KB
like image 103
Nickil Maveli Avatar answered Nov 08 '22 08:11

Nickil Maveli