Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Whole-Stage Code Generation in Spark 2.0

I heard about Whole-Stage Code Generation for sql to optimize queries. through p539-neumann.pdf & sparksql-sql-codegen-is-not-giving-any-improvemnt

But unfortunately no one gave answer to above question.

Curious to know about what are the scenarios to use this feature of Spark 2.0. But didn't get proper use-case after googling.

Whenever we are using sql, can we use this feature? if so, any proper use case to see this working?

like image 390
Ram Ghadiyaram Avatar asked Nov 11 '16 19:11

Ram Ghadiyaram


People also ask

What is whole-stage code generation in Spark?

Whole-Stage Java Code Generation (aka Whole-Stage CodeGen) is a physical query optimization in Spark SQL that fuses multiple physical operators (as a subtree of plans that support code generation) together into a single Java function.

What is tungsten Spark?

Tungsten is the codename for the umbrella project to make changes to Apache Spark's execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware.

What is Spark SQL?

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.

What is Spark Exchange?

The Exchange is the shuffle caused by the groupBy transformation. Spark performs a hash aggregation for each partition before shuffling the data in the Exchange. After the exchange, there is a hash aggregation of the previous sub-aggregations.


1 Answers

When you are using Spark 2.0, code generation is enabled by default. This allows for most DataFrame queries you are able to take advantage of the performance improvements. There are some potential exceptions such as using Python UDFs that may slow things down.

Code generation is one of the primary components of the Spark SQL engine's Catalyst Optimizer. In brief, the Catalyst Optimizer engine does the following: (1) analyzing a logical plan to resolve references, (2) logical plan optimization (3) physical planning, and (4) code generation

enter image description here

A great reference to all of this are the blog posts

HTH!

like image 110
Denny Lee Avatar answered Sep 22 '22 05:09

Denny Lee