Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Presto performance tuning, queries are much slower when performed in parallel

I have a presto cluster configured with 12 workers that is being queried by Java applications. The cluster is capable of performing 30 concurrent requests (if there are more, they are queued).

The applications might send around 80-100 distinct queries, which I expect to be handled by cluster.

Problem: When queries are performed sequentially they complete significantly faster than when they are performed in parallel.

For instance, if I run 100 queries sequentially each of them takes 1-12 seconds to complete and they all are completed in around 2 minutes. But if I start all of them in parallel it takes around 8-12 minutes to complete them all. At corner cases it takes up to 30 minutes.

If I look on the presto console I see that most of the queries are blocked and only 1-3 are in fact in Running state.

Unfortunately I can't post any of the queries. They usually access different schemas (up to 6 in one query), they are full of joins and nested queries. At the same time most of them are written following presto best practices.

Question: How can I improve performance? At least what areas should I investigate to find out the root cause?

Here are some metrics for one of the slowest queries (may be the numbers will say something to you).

Resource Utilization Summary

CPU Time            8.42m
Scheduled Time      26.04m
Blocked Time        4.77d
Input Rows          298M
Input Data          9.94GB
Raw Input Rows      323M
Raw Input Data      4.34GB
Peak Memory         10.18GB
Memory Pool         reserved
Cumulative Memory   181G seconds

Timeline

Parallelism         477
Scheduled Time/s    1.47K
Input Rows/s        281K
Input Bytes/s       9.60MB
Memory Utilization  0B
like image 901
Sasha Shpota Avatar asked Jul 12 '18 09:07

Sasha Shpota


People also ask

Is Presto slow?

Presto – extremely fast (168–177 BPM)

Why is Presto faster?

Efficiency and speed are important for query performance so Presto has a number of design features to maximize speed such as in-memory pipelined execution, a distributed scale-out architecture, and massively parallel processing (MPP) design.


1 Answers

It seems like I figured out the issue myself.

Presto is a distributed SQL query engine. And the key word here is distributed. It guarantees that if you run a query it is efficiently distributed among workers and performed with high speed.

Performing parallel queries and expecting that Presto will figure out how to efficiently parallel them is most likely a misuse. It is more like relational database approach which unfortunately doesn't work in Presto.

like image 146
Sasha Shpota Avatar answered Oct 01 '22 02:10

Sasha Shpota