Main causes for leader node to be at high CPU

Tags:

I often see the leader node of our Redshift cluster peak at 100% CPU. I've identified one possible cause for this: many concurrent queries, and so many execution plans for the leader to calculate. This hypothesis seems very likely as the times we have the most queries coming seem to be the same as the ones we see the leader at 100%.

To fix this at best, we are wondering: are there other main possible causes for a high CPU on the leader?

(I'm precising the situation is when only the leader node is at high CPU and the workers seem fine)

453

asked Oct 14 '19 08:10

totooooo

1 Answers

The Redshift leader node is the same size and class of compute as the compute nodes. Typically this means that the leader is over provisioned for the role it plays but since its role is so important and impactful if things slows down, it is good that it is over provisioned. The leader needs to compile and optimized the queries and perform final steps in queries (final sort for example). It communicates with the session clients and handles all their requests. If the leader becomes overloaded all these activities slow down creating significant performance issues. It is not good that your leader is hitting 100% CPU often enough for you to notice. I bet the seems sluggish when this happens.

There are a number of ways I've seen "leader abuse" and it usually becomes a problem when bad patterns are copied between users. In no particular order:

Large data literals in queries (INSERT ... VALUES ...). This puts your data through the query compiler on the leader node. This is not what it is design to do and is very expensive for the leader. Use the COPY command to bring data into the cluster. (Just bad, don't do this)
Over use of COMMIT. A commits cause an update to the coherent state of the database and needs to run through the "commit queue" and creates work for the leader and the compute nodes. Having COMMITs every other statement can cause this queue to back up and work to generally back up.
Too many slots defined in the WLM. Redshift can typically only efficiently run between 1 and 2 dozen queries at once. Setting the total slot count very high (like 50) can lead to very inefficient operation and high CPU loads. Depending on workload this can show up for compute or occasionally the lead node.
Large data output through SELECT statements. SELECTs return data but when this data is many GBs in size the management of this data movement (and sorting) is done by the leader node. If large amounts of data need to be extracted from Redshift it should be done with an UNLOAD statement.
Overuse of large cursors. Cursors can be an important tool and needed for many BI tools but cursors are located on the leader and overuse can lead to reduced leader attention on other tasks.
Many / large UNLOADs with parallel off. UNLOADs generally come from the compute nodes straight to S3 but with "parallel off" all the data is routed to the leader node where it is combined (sorted) and sent to S3.

While none of the above of problems in and of themselves, it is when these are overused, used in ways they are not intended, or all at once that the leader starts to be impacted. It also comes down to what you intend to do with your cluster - if it support BI tools then you may have a lot of cursors but this load on the leader is part of the cluster's intent. Issue often arise when the cluster's intent is to all things to everybody.

If your workload for Redshift is leader function heavy and you are efficiently using the leader node (no large literals, using COPY and UNLOAD, etc.) then high leader workload is what you want. You're getting the most out of the critical resource. However, most use Redshift to perform analytics on large data which is the function of the compute nodes. A highly loaded leader can detract significantly from this mission and needs to be addressed.

Another way that leader can get stressed is when clusters are configured with many smaller node types instead of fewer bigger nodes. Since the leader is the same size as the compute nodes many smaller nodes means you have a small leader doing the work. Something to consider but I'd make sure you don't have unneeded leader node stressers before investing in a resize.

129

answered Oct 18 '22 06:10

Bill Weiner

Related questions
                            
                                How to use Terraform to execute SQL script on RDS MySQL?
                            
                                Why well-designed DynamoDB application require only one table?
                            
                                Optional identifiers in cognito hosted UI do not work
                            
                                Using hashed assets with CDNs on a Rails rolling deployment
                            
                                How Does AWS CloudFront Detect Mobile Devices
                            
                                We're getting an error when users are trying to sign up and create new accounts/ albums using AWS. See terminal message below
                            
                                AWS Glue job hangs when calling the AWS Glue client API using boto3 from the context of a running AWS Glue Job?
                            
                                "Missing credentials in config" with aws-amplify Storage
                            
                                Why is my lambda not able to talk to elasticache?
                            
                                Process for uploading image to s3 with AWS Appsync || iOS image uploading with Appsync
                            
                                AWS: Placing an Application Load Balance in front of Elasticsearch Managed Service
                            
                                Registering AWSS3 configuration creating 5000+ memory leaks
                            
                                how to add AWS API gateway with application load balancer for ECS?
                            
                                AWS HTTP API - same request but different response in Python Requests vs Dart HTTP
                            
                                Node module-alias Error: Cannot find module '@src/utils/constants'
                            
                                GlacierSelectNotAvailable: Glacier select retrievals are currently not available, please try again later
                            
                                How to get Glue development endpoint URL inside CloudFormation template?
                            
                                Equivalent of ItemUtils.toAttributeValue in DynamoDB JDK 2.X?
                            
                                Install Multiple Package using Sagemaker Life Cycle configuration file
                            
                                How to use S3 SSE C (Server Side Encryption with Client Provided Keys) with Ruby

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Main causes for leader node to be at high CPU

Tags:

amazon-web-services

amazon-redshift

totooooo

People also ask

1 Answers

Bill Weiner

Recent Activity

Donate For Us