Pivot by year and get sum of amounts from 2020

Question

I have data like this

enter image description here

I want to pivot by year and show the total only from 2020

enter image description here

How do I achieve this?

Vaebhav · Accepted Answer

You can achieve this as below using PIVOT on year and aggregate SUM on amount and further use reduce and add to generate total

Data Preparation

df = pd.DataFrame(
    {"id": [1]*5,
     "type": ['A','A','A','B','B'],
     "year": [2019,2020,2021,2019,2021],
     "amount":[50,75,100,25,75]
    })


sparkDF = sql.createDataFrame(df)

sparkDF.show()

+---+----+----+------+
| id|type|year|amount|
+---+----+----+------+
|  1|   A|2019|    50|
|  1|   A|2020|    75|
|  1|   A|2021|   100|
|  1|   B|2019|    25|
|  1|   B|2021|    75|
+---+----+----+------+

Pivot - All

from operator import add

sparkDF_pivot = sparkDF.groupBy("id","type")\
                         .pivot("year")\
                         .agg(F.sum(F.col('amount')))\
                         .fillna(0)
    
sparkDF_pivot.withColumn('total',reduce(add, [F.col(x) for x in sparkDF_pivot.columns if x not in ['id','type']]))\
             .show()


+---+----+----+----+----+-----+
| id|type|2019|2020|2021|total|
+---+----+----+----+----+-----+
|  1|   B|  25|   0|  75|  100|
|  1|   A|  50|  75| 100|  225|
+---+----+----+----+----+-----+

Pivot - Specific Year Example

from operator import add

sparkDF_pivot = sparkDF.groupBy("id","type")\
                         .pivot("year")\
                         .agg(F.sum(F.col('amount')))\
                         .fillna(0)
    
sparkDF_pivot.withColumn('total',reduce(add, [F.col(x) 
                 for x in sparkDF_pivot.columns if x not in ['id','type',2019]]))\
             .show()


+---+----+----+----+----+-----+
| id|type|2019|2020|2021|total|
+---+----+----+----+----+-----+
|  1|   B|  25|   0|  75|   75|
|  1|   A|  50|  75| 100|  175|
+---+----+----+----+----+-----+

wwnde · Answer

Use slice. In this case I have a list [2019, 2020, 2021,...] using slice(array, startpostion, length) I can leave out 2019 by starting at position 2 and going the length of the size of the array.

s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
 .withColumn('x', expr("aggregate(slice(x, 2, size(x)),cast(0 as double),(c,i)-> c+coalesce(i,0))"))#sum
).show()

data

df = spark.createDataFrame(pd.DataFrame(
    {"id": [1]*5,
     "type": ['A','A','A','B','B'],
     "year": [2019,2020,2021,2019,2021],
     "amount":[50,75,100,25,75]
    }))

solution

+---+----+----+----+-----+
| id|2019|2020|2021|    x|
+---+----+----+----+-----+
|  1|  75|  75| 175|250.0|
+---+----+----+----+-----+

Pivot by year and get sum of amounts from 2020

Tags:

pivot

pyspark

sparc

2 Answers

Data Preparation

Pivot - All

Pivot - Specific Year Example

Vaebhav

wwnde

Recent Activity

Donate For Us

Pivot by year and get sum of amounts from 2020

Tags:

pivot

pyspark

sparc

2 Answers

Data Preparation

Pivot - All

Pivot - Specific Year Example

Vaebhav

wwnde

Related questions

Recent Activity

Donate For Us