Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe: omit weekends and days near holidays

Tags:

python

pandas

I have a Pandas dataframe with a DataTimeIndex and some other columns, similar to this:

import pandas as pd
import numpy as np

range = pd.date_range('2017-12-01', '2018-01-05', freq='6H')
df = pd.DataFrame(index = range)

# Average speed in miles per hour
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))

df.info()
# DatetimeIndex: 141 entries, 2017-12-01 00:00:00 to 2018-01-05 00:00:00
# Freq: 6H
# Data columns (total 1 columns):
# value    141 non-null int64
# dtypes: int64(1)
# memory usage: 2.2 KB

df.head(10)
#                      value
# 2017-12-01 00:00:00     15
# 2017-12-01 06:00:00     54
# 2017-12-01 12:00:00     19
# 2017-12-01 18:00:00     13
# 2017-12-02 00:00:00     35
# 2017-12-02 06:00:00     31
# 2017-12-02 12:00:00     58
# 2017-12-02 18:00:00      6
# 2017-12-03 00:00:00      8
# 2017-12-03 06:00:00     30

How can I select or filter the entries that are:

  1. Weekdays only (that is, not weekend days Saturday or Sunday)

  2. Not within N days of the dates in a list (e.g. U.S. holidays like '12-25' or '01-01')?

I was hoping for something like:

df = exclude_Sat_and_Sun(df)

omit_days = ['12-25', '01-01']
N = 3 # days near the holidays
df = exclude_days_near_omit_days(N, omit_days)

I was thinking of creating a new column to break out the month and day and then comparing them to the criteria for 1 and 2 above. However, I was hoping for something more Pythonic using the DateTimeIndex.

Thanks for any help.

like image 779
stackoverflowuser2010 Avatar asked Feb 16 '18 00:02

stackoverflowuser2010


2 Answers

The first part can be easily accomplished using the Pandas DatetimeIndex.dayofweek property, which starts counting weekdays with Monday as 0 and ending with Sunday as 6.

df[df.index.dayofweek < 5] will give you only the weekdays.


For the second part you can use the datetime module. Below I will give an example for only one date, namely 2017-12-25. You can easily generalize it to a list of dates, for example by defining a helper function.

from datetime import datetime, timedelta

N = 3

df[abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N)]

This will give all dates that are more than N=3 days away from 2017-12-25. That is, it will exclude an interval of 7 days from 2017-12-22 to 2017-12-28.


Lastly, you can combine the two criteria using the & operator, as you probably know.

df[
   (df.index.dayofweek < 5) 
   & 
   (abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N))
  ]
like image 105
Bahman Engheta Avatar answered Sep 22 '22 12:09

Bahman Engheta


I followed the answer by @Bahman Engheta and created a function to omit dates from a dataframe.

import pandas as pd
from datetime import datetime, timedelta

def omit_dates(df, list_years, list_dates, omit_days_near=3, omit_weekends=False):
    '''
    Given a Pandas dataframe with a DatetimeIndex, remove rows that have a date
    near a given list of dates and/or a date on a weekend.

    Parameters:
    ----------

    df : Pandas dataframe

    list_years : list of str
        Contains a list of years in string form
    list_dates : list of str
        Contains a list of dates in string form encoded as MM-DD
    omit_days_near : int
        Threshold of days away from list_dates to remove. For example, if
        omit_days_near=3, then omit all days that are 3 days away from 
        any date in list_dates.
    omit_weekends : bool
        If true, omit dates that are on weekends.

    Returns:
    -------
    Pandas dataframe
        New resulting dataframe with dates omitted.
    '''

    if not isinstance(df, pd.core.frame.DataFrame):
        raise ValueError("df is expected to be a Pandas dataframe, not %s" % type(df).__name__)

    if not isinstance(df.index, pd.tseries.index.DatetimeIndex):
        raise ValueError("Dataframe is expected to have an index of DateTimeIndex, not %s" %
                         type(df.index).__name__)

    if not isinstance(list_years, list):
        list_years = [list_years]

    if not isinstance(list_dates, list):
        list_dates = [list_dates]

    result = df.copy()

    if omit_weekends:
        result = result.loc[result.index.dayofweek < 5]

    omit_dates = [ '%s-%s' % (year, date) for year in list_years for date in list_dates ]

    for date in omit_dates:
        result = result.loc[abs(result.index.date - datetime.strptime(date, '%Y-%m-%d').date()) > timedelta(omit_days_near)]

    return result

Here is example usage. Suppose you have a dataframe that has a DateTimeIndex and other columns, like this:

import pandas as pd
import numpy as np

range = pd.date_range('2017-12-01', '2018-01-05', freq='1D')
df = pd.DataFrame(index = range)

df['value'] = np.random.randint(low=0, high=60, size=len(df.index))

The resulting dataframe looks like this:

            value
2017-12-01     42
2017-12-02     35
2017-12-03     49
2017-12-04     25
2017-12-05     19
2017-12-06     28
2017-12-07     21
2017-12-08     57
2017-12-09      3
2017-12-10     57
2017-12-11     46
2017-12-12     20
2017-12-13      7
2017-12-14      5
2017-12-15     30
2017-12-16     57
2017-12-17      4
2017-12-18     46
2017-12-19     32
2017-12-20     48
2017-12-21     55
2017-12-22     52
2017-12-23     45
2017-12-24     34
2017-12-25     42
2017-12-26     33
2017-12-27     17
2017-12-28      2
2017-12-29      2
2017-12-30     51
2017-12-31     19
2018-01-01      6
2018-01-02     43
2018-01-03     11
2018-01-04     45
2018-01-05     45

Now, let's specify dates to remove. I want to remove the dates '12-10', '12-25', '12-31', and '01-01' (following MM-DD notation) and all dates within 2 days of those dates. Further, I want to remove those dates from both the years '2016' and '2017'. I also want to remove weekend dates.

I'll call my function like this:

years = ['2016', '2017']
holiday_dates = ['12-10', '12-25', '12-31', '01-01']
omit_dates(df, years, holiday_dates, omit_days_near=2, omit_weekends=True)

The result is:

            value
2017-12-01     42
2017-12-04     25
2017-12-05     19
2017-12-06     28
2017-12-07     21
2017-12-13      7
2017-12-14      5
2017-12-15     30
2017-12-18     46
2017-12-19     32
2017-12-20     48
2017-12-21     55
2017-12-22     52
2017-12-28      2
2018-01-03     11
2018-01-04     45
2018-01-05     45

Is that answer correct? Here are the calendars for December 2017 and January 2018:

   December 2017      
Su Mo Tu We Th Fr Sa  
                1  2  
 3  4  5  6  7  8  9  
10 11 12 13 14 15 16  
17 18 19 20 21 22 23  
24 25 26 27 28 29 30  
31   

    January 2018      
Su Mo Tu We Th Fr Sa  
    1  2  3  4  5  6  
 7  8  9 10 11 12 13  
14 15 16 17 18 19 20  
21 22 23 24 25 26 27  
28 29 30 31   

Looks like it works.

like image 42
stackoverflowuser2010 Avatar answered Sep 22 '22 12:09

stackoverflowuser2010