Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Occasional 'temporary failure in name resolution' while connecting to AWS Aurora cluster

I am running an Amazon Web Services RDS Aurora 5.6 database cluster. There are a couple of lambda's talking to these database instances, all written in python. Now everything was running well, but then suddenly, since a couple of days ago, the python code sometimes starts throwing the following error:

[ERROR] InterfaceError: 2003: Can't connect to MySQL server on 'CLUSTER-DOMAIN:3306' (-3 Temporary failure in name resolution)

This happens in 1 every 1000 or so new connections. What is interesting that I haven't touched this whole service in the last couple of days (since it started happening). All lambdas are using the official MySQL-connector client and connect on every initialization with the following snippet:

import mysql.connector as mysql
import os

connection = mysql.connect(user=os.environ['DATABASE_USER'],
                         password=os.environ['DATABASE_PASSWORD'],
                         database=os.environ['DATABASE_NAME'],
                         host=os.environ['DATABASE_HOST'],
                         autocommit=True)

To rule out that this is a problem in the Python MySQL client I added the following to resolve the host:

import os
import socket

host = socket.gethostbyname(os.environ['DATABASE_HOST'])

Also here I sometimes get the following error:

[ERROR] gaierror: [Errno -2] Name or service not known

Now I suspect this has something to do with DNS, but since I'm just using the cluster endpoint there is not much I can do about that. What is interesting is that I also recently encountered exactly the same problem in a different region, with the same setup (Aurora 5.6 cluster, lambda's in python connecting to it) and the same happens there.

I've tried restarting all the machines in the cluster, but the problem still seems to occur. Is this really a DNS issue? What can do I to stop this from happening?

like image 390
Jorden van Breemen Avatar asked Oct 01 '19 06:10

Jorden van Breemen


2 Answers

AWS Support have told me that this error is likely to be caused by a traffic quota in AWS's VPCs.

According to their documentation on DNS Quotas:

Each Amazon EC2 instance limits the number of packets that can be sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface. This quota cannot be increased. The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use. For more information and recommendations for a scalable DNS architecture, see the Hybrid Cloud DNS Solutions for Amazon VPC whitepaper.

It's important to note that the metric we're looking at here is packets per second, per ENI. What's important about this? Well, it may not be immediately obvious that although the actual number of packets per query varies, there are typically multiple packets per DNS query.

While these packets cannot be seen in VPC flow logs, upon reviewing my own packet captures, I can see some resolutions consisting of about 4 packets.

Unfortunately, I can't say much about the whitepaper; at this stage, I'm not really considering the implementation of a hybrid DNS service as a "good" solution.

Solutions

I'm looking into ways to alleviate the risk of this error occurring, and to limit it's impacts when it does occur. As I see it, there are number of options to achieve this:

  1. Force Lambda Functions to resolve the Aurora Cluster's DNS before doing anything else and use the private IP address for the connection and handle failures with an exponential back-off. To minimise the cost of waiting for reties, I've set a total timeout of 5 seconds for DNS resolution. This number includes all back-off wait time.
  2. Making many, short-lived connections comes with a potentially costly overhead, even if you're closing the connection. Consider using connection pooling on the client side, as it is a common misconception that Aurora's connection pooling is sufficient to handle the overhead of many short-lived connections.
  3. Try not to rely on DNS where possible. Aurora automatically handles failover and promotion/demotion of instances, so it's important to know that you're always connected to the "right" (or write, in some cases :P) instance. As updates to the Aurora cluster's DNS name can take time to propagate, even with it's 5 second TTLs, it might be better to make use of the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table, in which MySQL exposes " in near-real-time" metadata about DB instances. Note that the table "contains cluster-wide metadata". If you cbf, have a look at option 4.
  4. Use a smart driver, which:

    is a database driver or connector with the ability to read DB cluster topology from the metadata table. It can route new connections to individual instance endpoints without relying on high-level cluster endpoints. A smart driver is also typically capable of load balancing read-only connections across the available Aurora Replicas in a round-robin fashion.

Not solutions

Initially, I thought it might be a good idea to create a CNAME which points to the cluster, but now I'm not so sure that caching Aurora DNS query results is wise. There are a few reasons for this, which are discussed in varying levels of details in The Aurora Connection Management Handbook:

  • Unless you use a smart database driver, you depend on DNS record updates and DNS propagation for failovers, instance scaling, and load balancing across Aurora Replicas. Currently, Aurora DNS zones use a short Time-To-Live (TTL) of 5 seconds. Ensure that your network and client configurations don’t further increase the DNS cache TTL

  • Aurora's cluster and reader endpoints abstract the role changes (primary instance promotion/demotion) and topology changes (addition and removal of instances) occurring in the DB cluster

I hope this helps!

like image 59
mikey Avatar answered Oct 11 '22 12:10

mikey


I had the same error with an instance (and ruled out the DNS lookup limit). After some time I stumbled on an AWS support thread indicating that it could be a hardware problem.

The physical underlying host of your instance (i-3d124c6d) looks to have intermittently been having a issues, some of which would have definitely caused service interruption.

Could you try stopping and starting this instance? Doing so will cause it to be brought up on new underlying hardware and then we could utilize your pingdom service to verify if further issues arise.

from: https://forums.aws.amazon.com/thread.jspa?threadID=171805.

Stopping and restarting the instance resolved the issue for me.

like image 2
Grey Vugrin Avatar answered Oct 11 '22 10:10

Grey Vugrin