Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quarter to date growth

I have some daily data in a df, which goes back as far as 1st January 2020. It looks similar to the below but with many id1s on each day.

| yyyy_mm_dd | id1 | id2  | cost  |
|------------|-----|------|-------|
| 2020-01-01 | 23  | 7253 | 5003  |
| 2020-01-01 | 23  | 7743 | 30340 |
| 2020-01-02 | 23  | 7253 | 450   |
| 2020-01-02 | 23  | 7743 | 4500  |
| ...        | ... | ...  | ...   |
| 2021-01-01 | 23  | 7253 | 5675  |
| 2021-01-01 | 23  | 134  | 1030  |
| 2021-01-01 | 23  | 3445 | 564   |
| 2021-01-01 | 23  | 4534 | 345   |
| ...        | ... | ...  | ...   |

I have grouped and calculated the summed cost like so:

grouped_quarterly = (
    df
    .withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
    .groupby('id1', 'year_quarter')
    .agg(
        F.sum('cost').alias('cost')
    )
)

I am able to then successfully make a quarter over quarter comparison like so:

w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
growth = (
    grouped_quarterly
    .withColumn('prev_value', F.lag(F.col('cost')).over(w))
    .withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value')))
).where(F.col('year_quarter') >= 202101)

I would like to modify this to be quarter to date instead of quarter over quarter. For example, the above would compare April 1st 2020 - June 30th 2020 with April 1st 2020 - April 15th 2021 (or whatever maximum date in df is).

Instead, I would prefer to compare April 1st 2020 - April 15th 2020 with April 1st 2021 - April 15th 2021.

Is it possible to ensure only the same periods are compared within year_quarter?

Edit: Adding sample output:


grouped_quarterly.where(F.col('id1') == 222).sort('year_quarter').show(10,False)

| id1 | year_quarter | cost  |
|-----|--------------|-------|
| 222 | 202001       | 49428 |
| 222 | 202002       | 43292 |
| 222 | 202003       | 73928 |
| 222 | 202004       | 12028 |
| 222 | 202101       | 19382 |
| 222 | 202102       | 4282  |

growth.where(F.col('id1') == 222).sort('year_quarter').show(10,False)

| id1 | year_quarter | cost  | prev_value | diff   | growth |
|-----|--------------|-------|------------|--------|--------|
| 222 | 202101       | 52494 | 49428      | 3066   | 6.20   |
| 222 | 202102       | 4282  | 43292      | -39010 | -90.10 |

The growth calculation from the window is being done correctly. However, since 202102 is in progress, it gets compared to the full 202002. The comparison for 202101 works perfectly as both year_quarters are complete.

Is there anyway to ensure the window function only compares the same period within the year_quarter with the previous year, for incomplete quarters? I hope the sample data makes my question a bit more clear

like image 653
stackq Avatar asked Apr 15 '21 10:04

stackq


People also ask

How do I calculate a quarter to date?

How Does Quarter to Date (QTD) Work? By adding the revenue for the three months of the first quarter, we can calculate that Company XYZ's quarter-to-date revenue is $4,500,000.

How do you calculate quarter growth?

Choose which time period (quarter) you want to calculate QoQ growth. Subtract last quarter's number from current quarter's number. If the number is positive, there has been quarter over quarter growth.

What is QOQ and yoy?

Comparing quarters on a year-over-year (YOY) basis can be more effective than on a quarter on quarter (QOQ) basis, as it gives a broader picture of company health and is not impacted by seasonal issues.

What is quarter on quarter growth?

Key Takeaways Quarter over quarter (Q/Q) measures the growth of an investment or a company from one quarter to the next. Q/Q is also used to measure changes in other important statistics, such as gross domestic product (GDP). Analysts consider Q/Q when reviewing a company's performance over multiple quarterly periods.


1 Answers

The idea is to split the task into two parts:

  1. Calculate the growth for the complete quarters. This logic is completely taken over from the question and then
  2. calculate the growth for the currently running quarter.

First generate some additional test data for 2019Q2, 2020Q2 and 2021Q2:

data = [('2019-04-01', 23, 1), ('2019-04-01', 23, 2), ('2019-04-02', 23, 3), ('2019-04-15', 23, 4),
        ('2019-04-16', 23, 5), ('2019-04-17', 23, 6), ('2019-05-01', 23, 7), ('2019-06-30', 23, 8),
        ('2019-07-01', 23, 9), ('2020-01-01',23,5003),('2020-01-01',23,30340), ('2020-01-02',23,450),
        ('2020-01-02',23,4500), ('2020-04-01', 23, 10), ('2020-04-01', 23, 20), ('2020-04-02', 23, 30),
        ('2020-04-15', 23, 40), ('2020-04-16', 23, 50), ('2020-04-17', 23, 60), ('2020-05-01', 23, 70),
        ('2020-06-30', 23, 80), ('2020-07-01', 23, 90), ('2021-01-01',23,5675), ('2021-01-01',23,1030),
        ('2021-01-01',23,564), ('2021-01-01',23,345), ('2021-04-01', 23, -10), ('2021-04-01', 23, -20),
        ('2021-04-02', 23, -30), ('2021-04-15', 23, -40)]

Calcuate the year_quarter column and cache the result:

df = spark.createDataFrame(data=data, schema = ["yyyy_mm_dd", "id1", "cost"]) \
    .withColumn("yyyy_mm_dd", F.to_date("yyyy_mm_dd", "yyyy-MM-dd")) \
    .withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd')))) \
    .cache()

Get the maximum date and its corresponding quarter:

max_row = df.selectExpr("max(yyyy_mm_dd)", "max_by(year_quarter, yyyy_mm_dd)").head()
cur_date, cur_quarter = max_row[0], max_row[1]

It is not strictly necessary to set cur_date to the maximum date of the data. Instead cur_date and cur_quarter could also be set manually.

For all quarters but the current one apply the logic given in the question:

w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
df_full_quarters = df.filter(f"year_quarter <> {cur_quarter}") \
    .groupby('id1', 'year_quarter') \
    .agg(F.sum('cost').alias('cost')) \
    .withColumn('prev_value', F.lag(F.col('cost')).over(w))

For the current quarter filter out all dates in the previous year that should be ignored:

df_cur_quarter = df.filter(f"year_quarter = {cur_quarter} or (year_quarter = {cur_quarter - 100} and add_months(yyyy_mm_dd, 12) <= '{cur_date}')") \
    .groupby('id1', 'year_quarter') \
    .agg(F.sum('cost').alias('cost')) \
    .withColumn('prev_value', F.lag(F.col('cost')).over(w)) \
    .filter(f"year_quarter = {cur_quarter}")

Finally union the two parts and calculate the diff column:

growth = df_full_quarters.union(df_cur_quarter) \
    .withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value'))) \
    .orderBy("id1", "year_quarter")

The result will be:

+---+------------+-----+----------+------+                                      
|id1|year_quarter| cost|prev_value|  diff|
+---+------------+-----+----------+------+
| 23|      201902|   36|      null|     0|
| 23|      201903|    9|      null|     0|
| 23|      202001|40293|      null|     0|
| 23|      202002|  360|        36|   324|
| 23|      202003|   90|         9|    81|
| 23|      202101| 7614|     40293|-32679|
| 23|      202102| -100|       100|  -200|
+---+------------+-----+----------+------+

In this example, for the comparison of 2021Q2 with the previous year the sum for 2020Q2 is given as 100, but the actual value for the full 2020Q2 is 360.

like image 184
werner Avatar answered Oct 09 '22 20:10

werner