Spark pivot invokes Job even though pivot is not an Action

Question

May be a silly question, but I note that:

val aggDF = df.groupBy("id").pivot("col1")

causes a Job to be invoked. Running under Databricks with Notebook. This is gotten:

(1) Spark Jobs
    Job 4 View     (Stages: 3/3)
       Stage 12:     8/8
       Stage 13:     200/200
       Stage 14:     1/1

I am not aware pivot is an Action from docs.

As per usual I cannot find a suitable reference in the docs to explain this, but there is likely be something to do with that pivot is seen as an Action or calls an aspect of Spark that is an Action.

baitmbarek · Accepted Answer

There are two versions of pivot in RelationalGroupedDataset.

If you pass only the columns, Spark has to fetch all the distinct values to generate columns, performing a collect.

The other method is more recommended but requires you to know in advance the possible values to generate columns.

You can take a look at the source code : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

def pivot(pivotColumn: Column): RelationalGroupedDataset

vs

def pivot(pivotColumn: Column, values: Seq[Any]): RelationalGroupedDataset

Spark pivot invokes Job even though pivot is not an Action

Tags:

apache-spark

apache-spark-sql

thebluephantom

1 Answers

baitmbarek

Recent Activity

Donate For Us

Spark pivot invokes Job even though pivot is not an Action

Tags:

apache-spark

apache-spark-sql

thebluephantom

1 Answers

baitmbarek

Related questions

Recent Activity

Donate For Us