Filtering pandas dataframe by day

Q: How do I filter out rows in Pandas DataFrame?

You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows. You can also write the above statement with a variable.

Tags:

performance

python

datetime

pandas

pandas-groupby

I have a pandas data frame with forex data by minutes, one year long (371635 rows):

                           O        H        L        C
0                                                      
2017-01-02 02:00:00  1.05155  1.05197  1.05155  1.05190
2017-01-02 02:01:00  1.05209  1.05209  1.05177  1.05179
2017-01-02 02:02:00  1.05177  1.05198  1.05177  1.05178
2017-01-02 02:03:00  1.05188  1.05200  1.05188  1.05200
2017-01-02 02:04:00  1.05196  1.05204  1.05196  1.05203

I want to filter daily data to get an hour range:

dt = datetime(2017,1,1)
df_day = df1[df.index.date == dt.date()]
df_day_t = df_day.between_time('08:30', '09:30')

If I do a for loop with 200 days, it takes minutes. I suspect that at every step this line

df_day = df1[df.index.date == dt.date()]

is looking for the equality with every row in the data set (even if it is an ordered data set).

Is there any way I could speed up the filtering or I should just do some old imperative for loop from January to December...?

271

asked Nov 09 '18 23:11

Stefano Piovesan

1 Answers

Avoid Python `datetime`

First you should avoid combining Python datetime with Pandas operations. There are many Pandas / NumPy friendly methods to create datetime objects for comparison, e.g. pd.Timestamp and pd.to_datetime. Your performance issues here are partly due to this behaviour described in the docs:

pd.Series.dt.date returns an array of python datetime.date objects

Using object dtype in this way removes vectorisation benefits, as operations then require Python-level loops.

Use `groupby` operations for aggregating by date

Pandas already has functionality to group by date via normalizing time:

for day, df_day in df.groupby(df.index.floor('d')):
    df_day_t = df_day.between_time('08:30', '09:30')
    # do something

As another example, you can access a slice for a particular day in this way:

g = df.groupby(df.index.floor('d'))
my_day = pd.Timestamp('2017-01-01')
df_slice = g.get_group(my_day)

153

answered Oct 16 '22 10:10

jpp

Related questions
                            
                                Python 3.5 string format: How to add a thousands-separator and also right justify?
                            
                                How to duplicate a specific value in a list/array?
                            
                                single element in a list
                            
                                Django initialize data test for all test classes
                            
                                Store filtered output of cmd command in a variable
                            
                                TypeError: 'dict_items' object is not subscriptable on running if statement to shortlist items
                            
                                OneHotEncoder - encoding only some of categorical variable columns
                            
                                Python seaborn catplot - How do I change the y-axis scale to percentage
                            
                                Force password authentication (ignore keys in .ssh folder) in Paramiko in Python
                            
                                Clustering images using unsupervised Machine Learning
                            
                                Python pyodbc.row to list
                            
                                Using deprecated Numpy API
                            
                                Loading numpy array from http response without saving a file
                            
                                Problem with saving spark DataFrame as Hive table
                            
                                spark possible to split dataframe into parts for topandas
                            
                                Add SVM to last layer
                            
                                how do I cluster a list of geographic points by distance?
                            
                                Convert column suffixes from pandas join into a MultiIndex
                            
                                Plotting a Model created with PyMC3 as a graph
                            
                                Detect if class was defined declarative or functional - possible?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filtering pandas dataframe by day

Tags:

performance

python

datetime

pandas

pandas-groupby

Stefano Piovesan

People also ask

1 Answers

Avoid Python `datetime`

Use `groupby` operations for aggregating by date

jpp

Recent Activity

Donate For Us

Filtering pandas dataframe by day

Tags:

performance

python

datetime

pandas

pandas-groupby

Stefano Piovesan

People also ask

1 Answers

Avoid Python datetime

Use groupby operations for aggregating by date

jpp

Related questions

Recent Activity

Donate For Us

Avoid Python `datetime`

Use `groupby` operations for aggregating by date