Why is Presto faster than Spark SQL?
Besides what is the difference between Presto and Spark SQL in computing architectures and memory management?
Presto follows the “push” model, which processes a SQL query using multiple stages running concurrently. An upstream stage receives data from its downstream stages, so the intermediate data can be passed directly, thus making the query significantly faster.
Presto is more commonly used to support interactive SQL queries. Queries are usually analytical but can perform SQL-based ETL. Spark is more general in its applications, often used for data transformation and Machine Learning workloads.
Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big.
Hive is optimized for query throughput, while Presto is optimized for latency. Presto has a limitation on the maximum amount of memory that each task in a query can store, so if a query requires a large amount of memory, the query simply fails.
In general, it is hard to say if Presto is definitely faster or slower than Spark SQL. It really depends on the type of query you’re executing, environment and engine tuning parameters. However, what I see in the industry(Uber, Neflix examples) Presto is used as ad-hock SQL analytics whereas Spark for ETL/ML pipelines.
One possible explanation, there is no much overhead for scheduling a query for Presto. Presto coordinator is always up and waits for query. On the other hand, Spark is doing lazy approach. It takes time for the driver to negotiate with the cluster manager the resources, copy jars and start processing.
Another one that Presto architecture quite straightforward. It has a coordinator that does SQL parsing, planning, scheduling and a set of workers that execute a physical plan.
On the other hand, Spark core has much more layers in between. Besides stages that Presto has, Spark SQL has to cope with a resiliency build into RDD, do resource management and negotiation for the jobs.
Please also note that Spark SQL has Cost-Based-Optimizer that performs better on complex queries. While Presto(0.199) has a legacy ruled based optimizer. There is ongoing effort to bring CBO to Presto which might potentially beat Spark SQL performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With