Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast selection of a time interval in a pandas DataFrame/Series

my problem is that I want to filter a DataFrame to only include times within the interval [start, end) . If do not care about the day, I would like to filter only for start and end time for each day. I have a solution for this but it is slow. So my question is if there is a faster way to do the time based filtering.

Example

import pandas as pd
import time


index=pd.date_range(start='2012-11-05 01:00:00', end='2012-11-05 23:00:00', freq='1S').tz_localize('UTC')
df=pd.DataFrame(range(len(index)), index=index, columns=['Number'])

# select from 1 to 2 am, include day
now=time.time()
df2=df.ix['2012-11-05 01:00:00':'2012-11-05 02:00:00']
print 'Took %s seconds' %(time.time()-now) #0.0368609428406

# select from 1 to 2 am, for every day
now=time.time()
selector=(df.index.hour>=1) & (df.index.hour<2)
df3=df[selector]
print 'Took %s seconds' %(time.time()-now) #Took  0.0699911117554

As you can see if I remove the day (second case) it takes almost twice as much. The computation time increases rapidly if I have a number of different days, e.g from 5 to 7 Nov:

index=pd.date_range(start='2012-11-05 01:00:00', end='2012-11-07 23:00:00', freq='1S').tz_localize('UTC')

So, to summarize is there a faster method to filter by time of the day, across many days?

Thx

like image 890
Mannaggia Avatar asked Feb 02 '14 14:02

Mannaggia


People also ask

Is Iterrows faster than apply?

This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes. See below for an example of how we could use apply for labeling the species in each row.

Is pandas good for time series?

pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.

How do I select a time range in Python?

Use pd. to_datetime to convert your strings to actual timestamps. You can then access parts of the datetime with e.g. df['Time']. dt.


1 Answers

You need between_time method.

In [14]: %timeit df.between_time(start_time='01:00', end_time='02:00')
100 loops, best of 3: 10.2 ms per loop

In [15]: %timeit selector=(df.index.hour>=1) & (df.index.hour<2); df[selector]
100 loops, best of 3: 18.2 ms per loop

I had done these tests with 5th to 7th November as index.

Documentation

Definition: df.between_time(self, start_time, end_time, include_start=True, include_end=True)
Docstring:
Select values between particular times of the day (e.g., 9:00-9:30 AM)

Parameters
----------
start_time : datetime.time or string
end_time : datetime.time or string
include_start : boolean, default True
include_end : boolean, default True

Returns
-------
values_between_time : type of caller

like image 103
Nipun Batra Avatar answered Sep 25 '22 03:09

Nipun Batra