I have a presto cluster configured with 12 workers that is being queried by Java applications. The cluster is capable of performing 30 concurrent requests (if there are more, they are queued). The applications might send around 80-100 distinct queries, which I expect to be handled by cluster. Problem: When queries are performed sequentially they complete significantly faster than when they are performed in parallel. For instance, if I run 100 queries sequentially each of them takes 1-12 seconds to complete and they all are completed in around 2 minutes. But if I start all of them in parallel it takes around 8-12 minutes to complete them all. At corner cases it takes up to 30 minutes. If I look on the presto console I see that most of the queries are blocked and only 1-3 are in fact in Running state. Unfortunately I can't post any of the queries. They usually access different schemas (up to 6 in one query), they are full of joins and nested queries. At the same time most of them are written following presto best practices. Question: How can I improve performance? At least what areas should I investigate to find out the root cause? Here are some metrics for one of the slowest queries (may be the numbers will say something to you). <pre class="prettyprint"><code>Resource Utilization Summary CPU Time 8.42m Scheduled Time 26.04m Blocked Time 4.77d Input Rows 298M Input Data 9.94GB Raw Input Rows 323M Raw Input Data 4.34GB Peak Memory 10.18GB Memory Pool reserved Cumulative Memory 181G seconds Timeline Parallelism 477 Scheduled Time/s 1.47K Input Rows/s 281K Input Bytes/s 9.60MB Memory Utilization 0B </code></pre>

It seems like I figured out the issue myself. Presto is a distributed SQL query engine. And the key word here is distributed. It guarantees that if you run a query it is efficiently distributed among workers and performed with high speed. Performing parallel queries and expecting that Presto will figure out how to efficiently parallel them is most likely a misuse. It is more like relational database approach which unfortunately doesn't work in Presto.

Presto performance tuning, queries are much slower when performed in parallel

Tags:

performance

sql

sqlperformance

presto

I have a presto cluster configured with 12 workers that is being queried by Java applications. The cluster is capable of performing 30 concurrent requests (if there are more, they are queued).

The applications might send around 80-100 distinct queries, which I expect to be handled by cluster.

Problem: When queries are performed sequentially they complete significantly faster than when they are performed in parallel.

For instance, if I run 100 queries sequentially each of them takes 1-12 seconds to complete and they all are completed in around 2 minutes. But if I start all of them in parallel it takes around 8-12 minutes to complete them all. At corner cases it takes up to 30 minutes.

If I look on the presto console I see that most of the queries are blocked and only 1-3 are in fact in Running state.

Unfortunately I can't post any of the queries. They usually access different schemas (up to 6 in one query), they are full of joins and nested queries. At the same time most of them are written following presto best practices.

Question: How can I improve performance? At least what areas should I investigate to find out the root cause?

Here are some metrics for one of the slowest queries (may be the numbers will say something to you).

Resource Utilization Summary

CPU Time            8.42m
Scheduled Time      26.04m
Blocked Time        4.77d
Input Rows          298M
Input Data          9.94GB
Raw Input Rows      323M
Raw Input Data      4.34GB
Peak Memory         10.18GB
Memory Pool         reserved
Cumulative Memory   181G seconds

Timeline

Parallelism         477
Scheduled Time/s    1.47K
Input Rows/s        281K
Input Bytes/s       9.60MB
Memory Utilization  0B

901

asked Jul 12 '18 09:07

Sasha Shpota

1 Answers

It seems like I figured out the issue myself.

Presto is a distributed SQL query engine. And the key word here is distributed. It guarantees that if you run a query it is efficiently distributed among workers and performed with high speed.

Performing parallel queries and expecting that Presto will figure out how to efficiently parallel them is most likely a misuse. It is more like relational database approach which unfortunately doesn't work in Presto.

146

answered Oct 01 '22 02:10

Sasha Shpota

Related questions
                            
                                Logic for a complex query which involves grouping and average in SQL
                            
                                Listing files in a specified directory using PL/SQL
                            
                                Computed columns sometimes missing from SELECT *
                            
                                Database Version / Change Control for Data not Schema?
                            
                                Why does MySQL LEFT JOIN not return all rows unless there is WHERE clause - phpMyAdmin issue
                            
                                Cannot connect to SQL Server Express 2014 from Visual Studio 2012
                            
                                Is it possible to add SignalR messages directly to the SQL Backplane?
                            
                                find and replace from another table mysql
                            
                                Automatically refresh Excel ODC connections and pivots without opening the file PowerShell
                            
                                Syntax error in SQL statement - H2 error 42001
                            
                                TSQL query returning double the rows with inner join of same table twice
                            
                                SQL MOVE Records to another table
                            
                                SQL query in Pro-C fails with Error:02115
                            
                                MySQL select multiple rows within a row
                            
                                Postgres subqueries running extremely slowly when joined
                            
                                SQL Server Query Optimisation - Unexpected slowness in a simple query
                            
                                illegal use of LONG datatype while copying Mysql table to Oracle
                            
                                Implement filters with counters
                            
                                Rails Not query on entire Where clause
                            
                                SQL Server detect other browsers from navigator.userAgent

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Presto performance tuning, queries are much slower when performed in parallel

Tags:

performance

sql

sqlperformance

presto

Sasha Shpota

People also ask

1 Answers

Sasha Shpota

Recent Activity

Donate For Us