What is Project node in execution query plan?

Tags:

1 Answers

NOTE: Since the plan uses RepartitionByExpression node it must be a logical query plan.

Project node in a logical query plan stands for Project unary logical operator and is created whenever you use some kind of projection explicitly or implicitly.

Quoting Wikipedia's Projection (relational algebra):

In practical terms, it can be roughly thought of as picking a subset of all available columns.

A Project node can appear in a logical query plan explicitly for the following:

Dataset operators, i.e. joinWith, select, unionByName
KeyValueGroupedDataset operators, i.e. keys, mapValues
SQL's SELECT queries

A Project node can also appear analysis and optimization phases.

In Spark SQL, the Dataset API gives the high-level operators, e.g. select, filter or groupBy, that ultimately build a Catalyst logical plan of a structured query. In other words, this simple-looking Dataset.select operator is just to create a LogicalPlan with Project node.

val query = spark.range(4).select("id")
scala> println(query.queryExecution.logical)
'Project [unresolvedalias('id, None)]
+- Range (0, 4, step=1, splits=Some(8))

(You could have used query.explain(extended = true) for the above but that would have given you all the 4 plans which may have hidden the point)

You could have a look at the code of Dataset.select operator.

def select(cols: Column*): DataFrame = withPlan {
  Project(cols.map(_.named), logicalPlan)
}

This simple-looking select operator is a mere wrapper around Catalyst operators to build a Catalyst tree of logical operators that gives a logical plan.

NOTE What's nice about Spark SQL's Catalyst is that it uses this recursive LogicalPlan abstraction that represents an logical operator or a tree of logical operator.

NOTE The same applies to the good ol' SQL where after being parsed the SQL text is transformed to a AST of logical operators. See the example below.

Project can come and go since projection is for the number of columns in the output and may or may not appear in your plans and queries.

Catalyst DSL

You can use Spark SQL's Catalyst DSL (in org.apache.spark.sql.catalyst.dsl package object) for constructing Catalyst data structures using Scala implicit conversions. That could be very useful if you are into Spark testing.

scala> spark.version
res0: String = 2.3.0-SNAPSHOT

import org.apache.spark.sql.catalyst.dsl.plans._  // <-- gives table and select
import org.apache.spark.sql.catalyst.dsl.expressions.star
val plan = table("a").select(star())
scala> println(plan.numberedTreeString)
00 'Project [*]
01 +- 'UnresolvedRelation `a`

Good ol' SQL

scala> spark.range(4).createOrReplaceTempView("nums")

scala> spark.sql("SHOW TABLES").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
|        |     nums|       true|
+--------+---------+-----------+


scala> spark.sql("SELECT * FROM nums").explain
== Physical Plan ==
*Range (0, 4, step=1, splits=8)

scala> spark.sql("SELECT * FROM nums").explain(true)
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `nums`

== Analyzed Logical Plan ==
id: bigint
Project [id#40L]
+- SubqueryAlias nums
   +- Range (0, 4, step=1, splits=Some(8))

== Optimized Logical Plan ==
Range (0, 4, step=1, splits=Some(8))

== Physical Plan ==
*Range (0, 4, step=1, splits=8)

158

answered Oct 10 '22 17:10

Jacek Laskowski

Related questions
                            
                                cannot resolve xyz given input columns error when creating Spark dataset
                            
                                Creating indices for each group in Spark dataframe
                            
                                java.lang.NoClassDefFoundError: Could not initialize class when launching spark job via spark-submit in scala code
                            
                                multi-processing with spark(PySpark) [duplicate]
                            
                                How to manually set group.id and commit kafka offsets in spark structured streaming?
                            
                                Use of lit() in expr()
                            
                                How to set group.id for consumer group in kafka data source in Structured Streaming?
                            
                                Can SPARK use multicore properly?
                            
                                Pass array as an UDF parameter in Spark SQL
                            
                                How does Spark on Yarn store shuffled files?
                            
                                Setting spark classpaths on EC2: spark.driver.extraClassPath and spark.executor.extraClassPath
                            
                                Basic Spark example not working
                            
                                winutils.exe chmod command doesn't set permission
                            
                                How to iterate scala wrappedArray? (Spark)
                            
                                sparkSession/sparkContext can not get hadoop configuration
                            
                                How to create Spark Dataset or Dataframe from case classes that contains Enums
                            
                                Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)
                            
                                Cumulate arrays from earlier rows (PySpark dataframe)
                            
                                Dropping empty DataFrame partitions in Apache Spark
                            
                                How to merge pyspark and pandas dataframes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is Project node in execution query plan?

Tags:

apache-spark

apache-spark-sql

Evan M.

People also ask

1 Answers

Catalyst DSL

Good ol' SQL

Jacek Laskowski

Recent Activity

Donate For Us