I have some daily data in a df
, which goes back as far as 1st January 2020. It looks similar to the below but with many id1
s on each day.
| yyyy_mm_dd | id1 | id2 | cost |
|------------|-----|------|-------|
| 2020-01-01 | 23 | 7253 | 5003 |
| 2020-01-01 | 23 | 7743 | 30340 |
| 2020-01-02 | 23 | 7253 | 450 |
| 2020-01-02 | 23 | 7743 | 4500 |
| ... | ... | ... | ... |
| 2021-01-01 | 23 | 7253 | 5675 |
| 2021-01-01 | 23 | 134 | 1030 |
| 2021-01-01 | 23 | 3445 | 564 |
| 2021-01-01 | 23 | 4534 | 345 |
| ... | ... | ... | ... |
I have grouped and calculated the summed cost like so:
grouped_quarterly = (
df
.withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
.groupby('id1', 'year_quarter')
.agg(
F.sum('cost').alias('cost')
)
)
I am able to then successfully make a quarter over quarter comparison like so:
w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', F.lag(F.col('cost')).over(w))
.withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value')))
).where(F.col('year_quarter') >= 202101)
I would like to modify this to be quarter to date instead of quarter over quarter. For example, the above would compare April 1st 2020 - June 30th 2020 with April 1st 2020 - April 15th 2021 (or whatever maximum date in df is).
Instead, I would prefer to compare April 1st 2020 - April 15th 2020 with April 1st 2021 - April 15th 2021.
Is it possible to ensure only the same periods are compared within year_quarter?
Edit: Adding sample output:
grouped_quarterly.where(F.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost |
|-----|--------------|-------|
| 222 | 202001 | 49428 |
| 222 | 202002 | 43292 |
| 222 | 202003 | 73928 |
| 222 | 202004 | 12028 |
| 222 | 202101 | 19382 |
| 222 | 202102 | 4282 |
growth.where(F.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost | prev_value | diff | growth |
|-----|--------------|-------|------------|--------|--------|
| 222 | 202101 | 52494 | 49428 | 3066 | 6.20 |
| 222 | 202102 | 4282 | 43292 | -39010 | -90.10 |
The growth calculation from the window is being done correctly. However, since 202102 is in progress, it gets compared to the full 202002. The comparison for 202101 works perfectly as both year_quarters are complete.
Is there anyway to ensure the window function only compares the same period within the year_quarter with the previous year, for incomplete quarters? I hope the sample data makes my question a bit more clear
How Does Quarter to Date (QTD) Work? By adding the revenue for the three months of the first quarter, we can calculate that Company XYZ's quarter-to-date revenue is $4,500,000.
Choose which time period (quarter) you want to calculate QoQ growth. Subtract last quarter's number from current quarter's number. If the number is positive, there has been quarter over quarter growth.
Comparing quarters on a year-over-year (YOY) basis can be more effective than on a quarter on quarter (QOQ) basis, as it gives a broader picture of company health and is not impacted by seasonal issues.
Key Takeaways Quarter over quarter (Q/Q) measures the growth of an investment or a company from one quarter to the next. Q/Q is also used to measure changes in other important statistics, such as gross domestic product (GDP). Analysts consider Q/Q when reviewing a company's performance over multiple quarterly periods.
The idea is to split the task into two parts:
First generate some additional test data for 2019Q2, 2020Q2 and 2021Q2:
data = [('2019-04-01', 23, 1), ('2019-04-01', 23, 2), ('2019-04-02', 23, 3), ('2019-04-15', 23, 4),
('2019-04-16', 23, 5), ('2019-04-17', 23, 6), ('2019-05-01', 23, 7), ('2019-06-30', 23, 8),
('2019-07-01', 23, 9), ('2020-01-01',23,5003),('2020-01-01',23,30340), ('2020-01-02',23,450),
('2020-01-02',23,4500), ('2020-04-01', 23, 10), ('2020-04-01', 23, 20), ('2020-04-02', 23, 30),
('2020-04-15', 23, 40), ('2020-04-16', 23, 50), ('2020-04-17', 23, 60), ('2020-05-01', 23, 70),
('2020-06-30', 23, 80), ('2020-07-01', 23, 90), ('2021-01-01',23,5675), ('2021-01-01',23,1030),
('2021-01-01',23,564), ('2021-01-01',23,345), ('2021-04-01', 23, -10), ('2021-04-01', 23, -20),
('2021-04-02', 23, -30), ('2021-04-15', 23, -40)]
Calcuate the year_quarter
column and cache the result:
df = spark.createDataFrame(data=data, schema = ["yyyy_mm_dd", "id1", "cost"]) \
.withColumn("yyyy_mm_dd", F.to_date("yyyy_mm_dd", "yyyy-MM-dd")) \
.withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd')))) \
.cache()
Get the maximum date and its corresponding quarter:
max_row = df.selectExpr("max(yyyy_mm_dd)", "max_by(year_quarter, yyyy_mm_dd)").head()
cur_date, cur_quarter = max_row[0], max_row[1]
It is not strictly necessary to set cur_date
to the maximum date of the data. Instead cur_date
and cur_quarter
could also be set manually.
For all quarters but the current one apply the logic given in the question:
w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
df_full_quarters = df.filter(f"year_quarter <> {cur_quarter}") \
.groupby('id1', 'year_quarter') \
.agg(F.sum('cost').alias('cost')) \
.withColumn('prev_value', F.lag(F.col('cost')).over(w))
For the current quarter filter out all dates in the previous year that should be ignored:
df_cur_quarter = df.filter(f"year_quarter = {cur_quarter} or (year_quarter = {cur_quarter - 100} and add_months(yyyy_mm_dd, 12) <= '{cur_date}')") \
.groupby('id1', 'year_quarter') \
.agg(F.sum('cost').alias('cost')) \
.withColumn('prev_value', F.lag(F.col('cost')).over(w)) \
.filter(f"year_quarter = {cur_quarter}")
Finally union the two parts and calculate the diff
column:
growth = df_full_quarters.union(df_cur_quarter) \
.withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value'))) \
.orderBy("id1", "year_quarter")
The result will be:
+---+------------+-----+----------+------+
|id1|year_quarter| cost|prev_value| diff|
+---+------------+-----+----------+------+
| 23| 201902| 36| null| 0|
| 23| 201903| 9| null| 0|
| 23| 202001|40293| null| 0|
| 23| 202002| 360| 36| 324|
| 23| 202003| 90| 9| 81|
| 23| 202101| 7614| 40293|-32679|
| 23| 202102| -100| 100| -200|
+---+------------+-----+----------+------+
In this example, for the comparison of 2021Q2 with the previous year the sum for 2020Q2 is given as 100, but the actual value for the full 2020Q2 is 360.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With