my problem is that I want to filter a DataFrame to only include times within the interval [start, end) . If do not care about the day, I would like to filter only for start and end time for each day. I have a solution for this but it is slow. So my question is if there is a faster way to do the time based filtering.
Example
import pandas as pd
import time
index=pd.date_range(start='2012-11-05 01:00:00', end='2012-11-05 23:00:00', freq='1S').tz_localize('UTC')
df=pd.DataFrame(range(len(index)), index=index, columns=['Number'])
# select from 1 to 2 am, include day
now=time.time()
df2=df.ix['2012-11-05 01:00:00':'2012-11-05 02:00:00']
print 'Took %s seconds' %(time.time()-now) #0.0368609428406
# select from 1 to 2 am, for every day
now=time.time()
selector=(df.index.hour>=1) & (df.index.hour<2)
df3=df[selector]
print 'Took %s seconds' %(time.time()-now) #Took 0.0699911117554
As you can see if I remove the day (second case) it takes almost twice as much. The computation time increases rapidly if I have a number of different days, e.g from 5 to 7 Nov:
index=pd.date_range(start='2012-11-05 01:00:00', end='2012-11-07 23:00:00', freq='1S').tz_localize('UTC')
So, to summarize is there a faster method to filter by time of the day, across many days?
Thx
This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes. See below for an example of how we could use apply for labeling the species in each row.
pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.
Use pd. to_datetime to convert your strings to actual timestamps. You can then access parts of the datetime with e.g. df['Time']. dt.
You need between_time
method.
In [14]: %timeit df.between_time(start_time='01:00', end_time='02:00')
100 loops, best of 3: 10.2 ms per loop
In [15]: %timeit selector=(df.index.hour>=1) & (df.index.hour<2); df[selector]
100 loops, best of 3: 18.2 ms per loop
I had done these tests with 5th to 7th November as index.
Definition: df.between_time(self, start_time, end_time, include_start=True, include_end=True) Docstring: Select values between particular times of the day (e.g., 9:00-9:30 AM) Parameters ---------- start_time : datetime.time or string end_time : datetime.time or string include_start : boolean, default True include_end : boolean, default True Returns ------- values_between_time : type of caller
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With