Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a calendar table (date dimension) in pandas

Tags:

pandas

A table of dates with primary keys is sometimes used in databse design.

| date_id |     Date       |    Record_timestamp |  Day      |  Week |  Month |     Quarter |   Year_half |     Year |
|---------+----------------+---------------------+-----------+-------+--------+-------------+-------------+----------|
|       0 |     2000-01-01 |    NaN              |  Saturday |  52   |  1     |     1       |   1         |     2000 |
|       1 |     2000-01-02 |    NaN              |  Sunday   |  52   |  1     |     1       |   1         |     2000 |
|       2 |     2000-01-03 |    NaN              |  Monday   |  1    |  1     |     1       |   1         |     2000 |

How to do it in pandas?

like image 523
redacted Avatar asked Nov 07 '17 05:11

redacted


People also ask

How do you create a date dimension table?

Create a table that will hold a sequence of dates called date_sequence. We will programmatically add dates for May 2022 to this table. Create a date dimension table with all of our dimension columns called date_dim, and then use the dates in date_sequence to calculate the values for each of those dimensions.

How do I create a date dimension in DAX?

Generate with DAXUse the CALENDAR function when you want to define a date range. You pass in two values: the start date and end date. These values can be defined by other DAX functions, like MIN(Sales[OrderDate]) or MAX(Sales[OrderDate]) .

How do you create a date dimension in Pyspark?

How to Begin. The process starts by generating an array of dates, then exploding this array into a data frame, and creating a temporary view called dates. Now that we have a temporary view containing dates, we can use Spark SQL to select the desired columns for the calendar dimension.


3 Answers

This is a little cleaner with the dt accessor:

In [11]: def create_date_table2(start='2000-01-01', end='2050-12-31'):
    ...:     df = pd.DataFrame({"Date": pd.date_range(start, end)})
    ...:     df["Day"] = df.Date.dt.weekday_name
    ...:     df["Week"] = df.Date.dt.weekofyear
    ...:     df["Quarter"] = df.Date.dt.quarter
    ...:     df["Year"] = df.Date.dt.year
    ...:     df["Year_half"] = (df.Quarter + 1) // 2
    ...:     return df

In [12]: create_date_table2().head()
Out[12]:
        Date        Day  Week  Quarter  Year  Year_half
0 2000-01-01   Saturday    52        1  2000          1
1 2000-01-02     Sunday    52        1  2000          1
2 2000-01-03     Monday     1        1  2000          1
3 2000-01-04    Tuesday     1        1  2000          1
4 2000-01-05  Wednesday     1        1  2000          1

In [13]: create_date_table2().tail()
Out[13]:
            Date        Day  Week  Quarter  Year  Year_half
18623 2050-12-27    Tuesday    52        4  2050          2
18624 2050-12-28  Wednesday    52        4  2050          2
18625 2050-12-29   Thursday    52        4  2050          2
18626 2050-12-30     Friday    52        4  2050          2
18627 2050-12-31   Saturday    52        4  2050          2

Note: you may like to calculate these on the fly rather than store them as columns!

like image 172
Andy Hayden Avatar answered Oct 23 '22 08:10

Andy Hayden


I liked Andy and Robin's approaches and modified their create_date_tables slightly for my needs in case you are interested in having a determinisitic date_id. I find this helpful so that in other future ETL processes, given a date, won't need to worry about extra look-up steps.

def create_date_table3(start='1990-01-01', end='2080-12-31'):
   df = pd.DataFrame({"date": pd.date_range(start, end)})
   df["week_day"] = df.date.dt.weekday_name
   df["day"] = df.date.dt.day
   df["month"] = df.date.dt.month
   df["week"] = df.date.dt.weekofyear
   df["quarter"] = df.date.dt.quarter
   df["year"] = df.date.dt.year
   df.insert(0, 'date_id', (df.year.astype(str) + df.month.astype(str).str.zfill(2) + df.day.astype(str).str.zfill(2)).astype(int))
   return df
like image 32
Jon Avatar answered Oct 23 '22 09:10

Jon


Use this function

def create_date_table(start='2000-01-01', end='2050-12-31'):
    start_ts = pd.to_datetime(start).date()

    end_ts = pd.to_datetime(end).date()

    # record timetsamp is empty for now
    dates =  pd.DataFrame(columns=['Record_timestamp'],
        index=pd.date_range(start_ts, end_ts))
    dates.index.name = 'Date'

    days_names = {
        i: name
        for i, name
        in enumerate(['Monday', 'Tuesday', 'Wednesday',
                      'Thursday', 'Friday', 'Saturday', 
                      'Sunday'])
    }

    dates['Day'] = dates.index.dayofweek.map(days_names.get)
    dates['Week'] = dates.index.week
    dates['Month'] = dates.index.month
    dates['Quarter'] = dates.index.quarter
    dates['Year_half'] = dates.index.month.map(lambda mth: 1 if mth <7 else 2)
    dates['Year'] = dates.index.year
    dates.reset_index(inplace=True)
    dates.index.name = 'date_id'
    return dates
like image 38
redacted Avatar answered Oct 23 '22 08:10

redacted