Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Presto is faster than Spark SQL [closed]

Why is Presto faster than Spark SQL?

Besides what is the difference between Presto and Spark SQL in computing architectures and memory management?

like image 374
Long.zhao Avatar asked Apr 25 '18 04:04

Long.zhao


People also ask

Why Presto is fast?

Presto follows the “push” model, which processes a SQL query using multiple stages running concurrently. An upstream stage receives data from its downstream stages, so the intermediate data can be passed directly, thus making the query significantly faster.

Is Presto better than Spark?

Presto is more commonly used to support interactive SQL queries. Queries are usually analytical but can perform SQL-based ETL. Spark is more general in its applications, often used for data transformation and Machine Learning workloads.

Why is Spark SQL so slow?

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big.

Why Presto is faster than Hive?

Hive is optimized for query throughput, while Presto is optimized for latency. Presto has a limitation on the maximum amount of memory that each task in a query can store, so if a query requires a large amount of memory, the query simply fails.


1 Answers

In general, it is hard to say if Presto is definitely faster or slower than Spark SQL. It really depends on the type of query you’re executing, environment and engine tuning parameters. However, what I see in the industry(Uber, Neflix examples) Presto is used as ad-hock SQL analytics whereas Spark for ETL/ML pipelines. 

One possible explanation, there is no much overhead for scheduling a query for Presto. Presto coordinator is always up and waits for query. On the other hand, Spark is doing lazy approach. It takes time for the driver to negotiate with the cluster manager the resources, copy jars and start processing.

Another one that Presto architecture quite straightforward. It has a coordinator that does SQL parsing, planning, scheduling and a set of workers that execute a physical plan.

enter image description here

On the other hand, Spark core has much more layers in between. Besides stages that Presto has, Spark SQL has to cope with a resiliency build into RDD, do resource management and negotiation for the jobs.

enter image description here

Please also note that Spark SQL has Cost-Based-Optimizer that performs better on complex queries. While Presto(0.199) has a legacy ruled based optimizer. There is ongoing effort to bring CBO to Presto which might potentially beat Spark SQL performance.

like image 175
Sayat Satybald Avatar answered Sep 22 '22 11:09

Sayat Satybald