Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark pivot invokes Job even though pivot is not an Action

May be a silly question, but I note that:

val aggDF = df.groupBy("id").pivot("col1")

causes a Job to be invoked. Running under Databricks with Notebook. This is gotten:

(1) Spark Jobs
    Job 4 View     (Stages: 3/3)
       Stage 12:     8/8
       Stage 13:     200/200
       Stage 14:     1/1

I am not aware pivot is an Action from docs.

As per usual I cannot find a suitable reference in the docs to explain this, but there is likely be something to do with that pivot is seen as an Action or calls an aspect of Spark that is an Action.

like image 466
thebluephantom Avatar asked Sep 11 '25 22:09

thebluephantom


1 Answers

There are two versions of pivot in RelationalGroupedDataset.

If you pass only the columns, Spark has to fetch all the distinct values to generate columns, performing a collect.

The other method is more recommended but requires you to know in advance the possible values to generate columns.

You can take a look at the source code : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

def pivot(pivotColumn: Column): RelationalGroupedDataset

vs

def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset
like image 170
baitmbarek Avatar answered Sep 13 '25 13:09

baitmbarek



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!