Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equal connection distribution is not happening in Aurora autoscaled insatnces

We are running a REST API based spring boot application using AWS Aurora as Database. Our application connects to read-only Aurora MySQL RDS instances. We are doing load testing on it. Initially we have one database and we have autoscaling in place, which is triggered on high CPU. Now we are expecting that if we are getting some X throughput with one db instance then we should be getting approx 1.8X when autoscaling happens, and connections should be distributed equally among with the newly created database instances. But it is not happening, instead DB connections are going up and down on both database instances erratically. Due to which our load is not getting distributed equally and we are not getting desired throughput. Sometimes one database is running on 100 % CPU while the other is still on 20% CPU and after few minutes it is reversed. Below are the database connection cofiguration :-

Driver - com.mysql.jdbc.driver
Maximum active connections=100
Max age = 300000
Initial pool size = 10

Tomcat jdbc pool is used for connection pooling

NOTE: 1) We have also disabled jvm network DNS caching. 2) we also tried refreshing the database connections every 5 minutes, Even the active ones. 3) We have tried everything suggested by AWS but nothing is working. 4)We have even written a lambda code to update Route 53 when new db instance comes up to avoid cluster endpoint caching but still same issue. Can anyone please help what is the best practice for this as currently we cannot take this into production.

like image 866
Akash Bindal Avatar asked Jul 26 '18 11:07

Akash Bindal


People also ask

What are some limitations of using Aurora serverless?

All Aurora Serverless v1 DB clusters have the following limitations: You can't export Aurora Serverless v1 snapshots to Amazon S3 buckets. You can't use AWS Database Migration Service and Change Data Capture (CDC) with Aurora Serverless v1 DB clusters.

Is AWS Aurora distributed?

Amazon Aurora Global Database is designed for globally distributed applications, allowing a single Amazon Aurora database to span multiple AWS Regions.

How many Aurora Read replicas can you have in a single Aurora DB cluster?

An Aurora DB cluster can contain up to 15 Aurora Replicas. The Aurora Replicas can be distributed across the Availability Zones that a DB cluster spans within an AWS Region.


1 Answers

This is not a great answer, but since you haven't gotten any replies yet some thoughts.

1) The behavior you are seeing replicates bad routing logic of load balancers
This is no surprise you, but this used to be much more common with small web server deployments – especially long running queries. With connection pooling, you mirror this situation.

2) Taking this assumption forward, we need to guess on how Amazon choose to balance traffic to read only replicas.
Even in their white paper, they don't mention how they are doing routing: https://www.allthingsdistributed.com/files/p1041-verbitski.pdf

Likely options are route53 or an NLB. My best guess would be that they are using an NLB. NLBs became available to us only in Q3 2017 and Aurora was 2 years before, but it still is a reasonable guess. NLBs would let us balance based on least connections (far better than round robin).

3) Validating assumptions
If route53 is being used, then we would be able to use DNS to find out. I did a dig against the route53 end point and found that it gave me an answer

dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-0.yyy.us-east-1.rds.amazonaws.com.
zzz-0.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.8.33

I did it again and got a different answer.

dig +nocmd +noall +answer zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com
zzz-databasecluster-xxx.cluster-ro-yyy.us-east-1.rds.amazonaws.com. 1 IN CNAME zzz-2.yyy.us-east-1.rds.amazonaws.com.
zzz-2.yyy.us-east-1.rds.amazonaws.com. 5 IN A 10.32.7.97

What you can see is that the read only endpoint is giving me a CNAME result to

Zzz is name of my cluster, yyy came from my cloudformation stack formation, and yyy comes from amazon.

Note: zzz-0 and zzz-2 are the two read only replicas.

What we can see here is that we have route53 for our load balancing.

4) Route53 Load Balancing
They are likely setting up Route53 with round robin on all healthy read only replicas.
The TTL is likely 5s. Healthy nodes will get removed, but there is no balancing based on

5) Ramifications
A) Using the Read Only end point can only balance traffic away from unhealthy instances
B) DB Pools will keep connections for a long time which means that new read replicas won’t be touched

If we have a small number of servers, we will be unbalanced – which we can’t do much against.

6) Thoughts on what you can do
A) Verify yourself with dig that you are getting correct DNS resolution that keeps rotating between replicas every 5s.
If you don’t, this is something you need to fix

B) Periodically recycle DB Clients
New replicas will get used and while you will be unbalanced, this will help by keeping changing. What is critical though is you MUST not have all your clients recycle at the same time. Otherwise, you run the risk of all getting the same time. I would suggest doing some random ttl per client (within min/max).

C) Manage it yourself

Summary: When you connect, connect directly to the read replica with least connection/lowest CPU.

How you do this is slightly not simplistic. I would suggest a lambda function that keeps this connection string in a queryable location. Have it update at some frequency. I would say the frequency of updating the preferred DB is 1/10 of the frequency you are recycle the DB connections. You could add logic if the DBs are running similarly, you give the readonly end point..and only give an explicit one when there is significant inequity.
I would caution when a new instance comes up you want to be careful of floating.

D) Increase number of clients or number of read only copies

Both of these would decrease the chance that two boxes would get significant differences.

like image 140
sfblackl Avatar answered Oct 30 '22 16:10

sfblackl