A table of dates with primary keys is sometimes used in databse design.
| date_id | Date | Record_timestamp | Day | Week | Month | Quarter | Year_half | Year |
|---------+----------------+---------------------+-----------+-------+--------+-------------+-------------+----------|
| 0 | 2000-01-01 | NaN | Saturday | 52 | 1 | 1 | 1 | 2000 |
| 1 | 2000-01-02 | NaN | Sunday | 52 | 1 | 1 | 1 | 2000 |
| 2 | 2000-01-03 | NaN | Monday | 1 | 1 | 1 | 1 | 2000 |
How to do it in pandas?
Create a table that will hold a sequence of dates called date_sequence. We will programmatically add dates for May 2022 to this table. Create a date dimension table with all of our dimension columns called date_dim, and then use the dates in date_sequence to calculate the values for each of those dimensions.
Generate with DAXUse the CALENDAR function when you want to define a date range. You pass in two values: the start date and end date. These values can be defined by other DAX functions, like MIN(Sales[OrderDate]) or MAX(Sales[OrderDate]) .
How to Begin. The process starts by generating an array of dates, then exploding this array into a data frame, and creating a temporary view called dates. Now that we have a temporary view containing dates, we can use Spark SQL to select the desired columns for the calendar dimension.
This is a little cleaner with the dt
accessor:
In [11]: def create_date_table2(start='2000-01-01', end='2050-12-31'):
...: df = pd.DataFrame({"Date": pd.date_range(start, end)})
...: df["Day"] = df.Date.dt.weekday_name
...: df["Week"] = df.Date.dt.weekofyear
...: df["Quarter"] = df.Date.dt.quarter
...: df["Year"] = df.Date.dt.year
...: df["Year_half"] = (df.Quarter + 1) // 2
...: return df
In [12]: create_date_table2().head()
Out[12]:
Date Day Week Quarter Year Year_half
0 2000-01-01 Saturday 52 1 2000 1
1 2000-01-02 Sunday 52 1 2000 1
2 2000-01-03 Monday 1 1 2000 1
3 2000-01-04 Tuesday 1 1 2000 1
4 2000-01-05 Wednesday 1 1 2000 1
In [13]: create_date_table2().tail()
Out[13]:
Date Day Week Quarter Year Year_half
18623 2050-12-27 Tuesday 52 4 2050 2
18624 2050-12-28 Wednesday 52 4 2050 2
18625 2050-12-29 Thursday 52 4 2050 2
18626 2050-12-30 Friday 52 4 2050 2
18627 2050-12-31 Saturday 52 4 2050 2
Note: you may like to calculate these on the fly rather than store them as columns!
I liked Andy and Robin's approaches and modified their create_date_table
s slightly for my needs in case you are interested in having a determinisitic date_id
. I find this helpful so that in other future ETL processes, given a date, won't need to worry about extra look-up steps.
def create_date_table3(start='1990-01-01', end='2080-12-31'):
df = pd.DataFrame({"date": pd.date_range(start, end)})
df["week_day"] = df.date.dt.weekday_name
df["day"] = df.date.dt.day
df["month"] = df.date.dt.month
df["week"] = df.date.dt.weekofyear
df["quarter"] = df.date.dt.quarter
df["year"] = df.date.dt.year
df.insert(0, 'date_id', (df.year.astype(str) + df.month.astype(str).str.zfill(2) + df.day.astype(str).str.zfill(2)).astype(int))
return df
Use this function
def create_date_table(start='2000-01-01', end='2050-12-31'):
start_ts = pd.to_datetime(start).date()
end_ts = pd.to_datetime(end).date()
# record timetsamp is empty for now
dates = pd.DataFrame(columns=['Record_timestamp'],
index=pd.date_range(start_ts, end_ts))
dates.index.name = 'Date'
days_names = {
i: name
for i, name
in enumerate(['Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday',
'Sunday'])
}
dates['Day'] = dates.index.dayofweek.map(days_names.get)
dates['Week'] = dates.index.week
dates['Month'] = dates.index.month
dates['Quarter'] = dates.index.quarter
dates['Year_half'] = dates.index.month.map(lambda mth: 1 if mth <7 else 2)
dates['Year'] = dates.index.year
dates.reset_index(inplace=True)
dates.index.name = 'date_id'
return dates
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With