Why Presto is faster than Spark SQL [closed]

1 Answers

In general, it is hard to say if Presto is definitely faster or slower than Spark SQL. It really depends on the type of query you’re executing, environment and engine tuning parameters. However, what I see in the industry(Uber, Neflix examples) Presto is used as ad-hock SQL analytics whereas Spark for ETL/ML pipelines.

One possible explanation, there is no much overhead for scheduling a query for Presto. Presto coordinator is always up and waits for query. On the other hand, Spark is doing lazy approach. It takes time for the driver to negotiate with the cluster manager the resources, copy jars and start processing.

Another one that Presto architecture quite straightforward. It has a coordinator that does SQL parsing, planning, scheduling and a set of workers that execute a physical plan.

enter image description here

On the other hand, Spark core has much more layers in between. Besides stages that Presto has, Spark SQL has to cope with a resiliency build into RDD, do resource management and negotiation for the jobs.

enter image description here

Please also note that Spark SQL has Cost-Based-Optimizer that performs better on complex queries. While Presto(0.199) has a legacy ruled based optimizer. There is ongoing effort to bring CBO to Presto which might potentially beat Spark SQL performance.

175

answered Sep 22 '22 11:09

Sayat Satybald

Related questions
                            
                                How to save a spark DataFrame as csv on disk?
                            
                                How to use AND or OR condition in when in Spark
                            
                                Read multiline JSON in Apache Spark
                            
                                Trim string column in PySpark dataframe
                            
                                SparkSQL: How to deal with null values in user defined function?
                            
                                Create spark dataframe schema from json schema representation
                            
                                Spark / Scala: forward fill with last observation
                            
                                What's the most efficient way to filter a DataFrame
                            
                                Spark DataFrame: does groupBy after orderBy maintain that order?
                            
                                Difference between createOrReplaceTempView and registerTempTable
                            
                                how to get max(date) from given set of data grouped by some fields using pyspark?
                            
                                Column name with dot spark
                            
                                Spark Equivalent of IF Then ELSE
                            
                                Spark 2.0 Dataset vs DataFrame
                            
                                Methods for writing Parquet files using Python?
                            
                                The value of "spark.yarn.executor.memoryOverhead" setting?
                            
                                spark access first n rows - take vs limit
                            
                                When to cache a DataFrame?
                            
                                writing a csv with column names and reading a csv file which is being generated from a sparksql dataframe in Pyspark
                            
                                Spark Unable to find JDBC Driver

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why Presto is faster than Spark SQL [closed]

Tags:

apache-spark-sql

presto

Long.zhao

People also ask

1 Answers

Sayat Satybald

Recent Activity

Donate For Us