May be a silly question, but I note that:
val aggDF = df.groupBy("id").pivot("col1")
causes a Job to be invoked. Running under Databricks with Notebook. This is gotten:
(1) Spark Jobs
Job 4 View (Stages: 3/3)
Stage 12: 8/8
Stage 13: 200/200
Stage 14: 1/1
I am not aware pivot
is an Action from docs.
As per usual I cannot find a suitable reference in the docs to explain this, but there is likely be something to do with that pivot
is seen as an Action or calls an aspect of Spark that is an Action.
There are two versions of pivot
in RelationalGroupedDataset
.
If you pass only the columns, Spark has to fetch all the distinct values to generate columns, performing a collect
.
The other method is more recommended but requires you to know in advance the possible values to generate columns.
You can take a look at the source code : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
def pivot(pivotColumn: Column): RelationalGroupedDataset
vs
def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With