I have a pandas data frame with forex data by minutes, one year long (371635 rows):
                           O        H        L        C
0                                                      
2017-01-02 02:00:00  1.05155  1.05197  1.05155  1.05190
2017-01-02 02:01:00  1.05209  1.05209  1.05177  1.05179
2017-01-02 02:02:00  1.05177  1.05198  1.05177  1.05178
2017-01-02 02:03:00  1.05188  1.05200  1.05188  1.05200
2017-01-02 02:04:00  1.05196  1.05204  1.05196  1.05203
I want to filter daily data to get an hour range:
dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')   
If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line
df_day = df1[df.index.date == dt.date()] 
is looking for the equality with every row in the data set (even if it is an ordered data set).
Is there any way I could speed up the filtering or I should just do some old imperative for loop from January to December...?
In order to select rows between two dates in pandas DataFrame, first, create a boolean mask using mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date) to represent the start and end of the date range. Then you select the DataFrame that lies within the range using the DataFrame.
You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.
datetime
First you should avoid combining Python datetime with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime objects for comparison, e.g. pd.Timestamp and pd.to_datetime. Your performance issues here are partly due to this behaviour described in the docs:
pd.Series.dt.datereturns an array of pythondatetime.dateobjects
Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.
groupby operations for aggregating by datePandas already has functionality to group by date via normalizing time:
for day, df_day in df.groupby(df.index.floor('d')):
    df_day_t = df_day.between_time('08:30', '09:30')
    # do something
As another example, you can access a slice for a particular day in this way:
g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With