I am running an Amazon Web Services RDS Aurora 5.6 database cluster. There are a couple of lambda's talking to these database instances, all written in python. Now everything was running well, but then suddenly, since a couple of days ago, the python code sometimes starts throwing the following error:
[ERROR] InterfaceError: 2003: Can't connect to MySQL server on 'CLUSTER-DOMAIN:3306' (-3 Temporary failure in name resolution)
This happens in 1 every 1000 or so new connections. What is interesting that I haven't touched this whole service in the last couple of days (since it started happening). All lambdas are using the official MySQL-connector client and connect on every initialization with the following snippet:
import mysql.connector as mysql
import os
connection = mysql.connect(user=os.environ['DATABASE_USER'],
password=os.environ['DATABASE_PASSWORD'],
database=os.environ['DATABASE_NAME'],
host=os.environ['DATABASE_HOST'],
autocommit=True)
To rule out that this is a problem in the Python MySQL client I added the following to resolve the host:
import os
import socket
host = socket.gethostbyname(os.environ['DATABASE_HOST'])
Also here I sometimes get the following error:
[ERROR] gaierror: [Errno -2] Name or service not known
Now I suspect this has something to do with DNS, but since I'm just using the cluster endpoint there is not much I can do about that. What is interesting is that I also recently encountered exactly the same problem in a different region, with the same setup (Aurora 5.6 cluster, lambda's in python connecting to it) and the same happens there.
I've tried restarting all the machines in the cluster, but the problem still seems to occur. Is this really a DNS issue? What can do I to stop this from happening?
AWS Support have told me that this error is likely to be caused by a traffic quota in AWS's VPCs.
According to their documentation on DNS Quotas:
Each Amazon EC2 instance limits the number of packets that can be sent to the Amazon-provided DNS server to a maximum of 1024 packets per second per network interface. This quota cannot be increased. The number of DNS queries per second supported by the Amazon-provided DNS server varies by the type of query, the size of response, and the protocol in use. For more information and recommendations for a scalable DNS architecture, see the Hybrid Cloud DNS Solutions for Amazon VPC whitepaper.
It's important to note that the metric we're looking at here is packets per second, per ENI. What's important about this? Well, it may not be immediately obvious that although the actual number of packets per query varies, there are typically multiple packets per DNS query.
While these packets cannot be seen in VPC flow logs, upon reviewing my own packet captures, I can see some resolutions consisting of about 4 packets.
Unfortunately, I can't say much about the whitepaper; at this stage, I'm not really considering the implementation of a hybrid DNS service as a "good" solution.
I'm looking into ways to alleviate the risk of this error occurring, and to limit it's impacts when it does occur. As I see it, there are number of options to achieve this:
INFORMATION_SCHEMA.REPLICA_HOST_STATUS
table, in which MySQL exposes " in near-real-time" metadata about DB instances. Note that the table "contains cluster-wide metadata". If you cbf, have a look at option 4.is a database driver or connector with the ability to read DB cluster topology from the metadata table. It can route new connections to individual instance endpoints without relying on high-level cluster endpoints. A smart driver is also typically capable of load balancing read-only connections across the available Aurora Replicas in a round-robin fashion.
Initially, I thought it might be a good idea to create a CNAME which points to the cluster, but now I'm not so sure that caching Aurora DNS query results is wise. There are a few reasons for this, which are discussed in varying levels of details in The Aurora Connection Management Handbook:
Unless you use a smart database driver, you depend on DNS record updates and DNS propagation for failovers, instance scaling, and load balancing across Aurora Replicas. Currently, Aurora DNS zones use a short Time-To-Live (TTL) of 5 seconds. Ensure that your network and client configurations don’t further increase the DNS cache TTL
Aurora's cluster and reader endpoints abstract the role changes (primary instance promotion/demotion) and topology changes (addition and removal of instances) occurring in the DB cluster
I hope this helps!
I had the same error with an instance (and ruled out the DNS lookup limit). After some time I stumbled on an AWS support thread indicating that it could be a hardware problem.
The physical underlying host of your instance (i-3d124c6d) looks to have intermittently been having a issues, some of which would have definitely caused service interruption.
Could you try stopping and starting this instance? Doing so will cause it to be brought up on new underlying hardware and then we could utilize your pingdom service to verify if further issues arise.
from: https://forums.aws.amazon.com/thread.jspa?threadID=171805.
Stopping and restarting the instance resolved the issue for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With