Disable spark catalyst optimizer

Tags:

To give some background, I am trying to run TPCDS benchmark on Spark with and without Spark's catalyst optimizer. For complicated queries on smaller datasets, we might be spending more time optimizing the plans than actually executing the plans. Hence wanted to measure the performance impact of optimizers on overall execution of the query

Is there a way to disable some or all of the spark catalyst optimization rules?

977

asked May 10 '18 08:05

ajaymysore

Video Answer

2 Answers

I know it's not the exact answer but it can help you.

Assuming your driver is not multithreaded. (hint for optimization if Catalyst is slow? :) )

If you want to measure time spent in Catalyst, just go to Spark UI and check how much time your executors are idle, or check the list of stages/jobs.

If you have a Job started at 15:30 with duration 30seconds, and next one starts at 15:32, it probably means catalyst is taking 1:30 to optimize (assuming no driver-heavy work is done).

Or even better, just put logs before calling every action in Spark and then just check how much time passes until the task is actually sent to the executor.

127

answered Oct 12 '22 23:10

BiS

This ability has been added as part of Spark-2.4.0 in SPARK-24802.

Click to copy

val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
    .doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " +
      "specified by their rule names and separated by comma. It is not guaranteed that all the " +
      "rules in this configuration will eventually be excluded, as some rules are necessary " +
      "for correctness. The optimizer will log the rules that have indeed been excluded.")
    .stringConf
    .createOptional

You could find the list of optimizer rules here.
But ideally, we shouldn't be disabling the rules, since most of them provide performance benefits. We should identify the rule that consumes time and check if is not useful for the query and then disable them.

answered Oct 13 '22 01:10

DaRkMaN

Related questions
                            
                                Setting up a Spark SQL connection with Kerberos
                            
                                Spark and Hive table schema out of sync after external overwrite
                            
                                Should I persist a Spark dataframe if I keep adding columns in it?
                            
                                Read a bytes column in spark
                            
                                How to solve an assignment problem (like Hungarian/linear_sum_assignment) with an edge case in PySpark UDF
                            
                                Apache Spark: distinct doesnt work?
                            
                                How to do time-series simple forecast?
                            
                                How do I process a graph that is constantly updating, with low latency?
                            
                                Is it necessary to submit spark application jar?
                            
                                Elaboration on why shuffle write data is way more then input data in apache spark
                            
                                How to clean up other resources when spark gets stopped
                            
                                Amazon EMR - how to set a timeout for a step
                            
                                Does Spark allow to use Amazon Assumed Role and STS temporary credentials for DynamoDB?
                            
                                Pyspark read csv with schema, header check, and store corrupt records
                            
                                How to avoid one Spark Streaming window blocking another window with both running some native Python code
                            
                                Prevent more IO with multiple pipelines on the same RDD
                            
                                PCA in Spark MLlib and Spark ML
                            
                                How to get accuracy precision, recall and ROC from cross validation in Spark ml lib?
                            
                                How to clean spark history event log with out stopping spark streaming
                            
                                Performance decrease for huge amount of columns. Pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Disable spark catalyst optimizer

Tags:

optimization

apache-spark

apache-spark-sql

query-optimization

spark-dataframe

ajaymysore

People also ask

Video Answer

2 Answers

BiS

DaRkMaN

Recent Activity

Donate For Us