Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intermittently can't connect to mysql on AWS RDS (Error 2003)

We are having an intermittent issue with connections to our mysql server timing out. The error we are receiving is as following.

(2003, 'Can\'t connect to MySQL server on \'<connection>\' ((2013, "Lost connection to MySQL server during query (error(104, \'Connection reset by peer\'))"))') Callstack: File "/usr/lib64/python2.7/site-packages/pymysql/connections.py", line 818, in _connect 2003, "Can't connect to MySQL server on %r (%s)" % (self.host, e)) File "/usr/lib64/python2.7/site-packages/pymysql/connections.py", line 626, in __init__ self._connect()

Some more info:

  • We have a flight of EC2 servers that are constantly running queries to a backend RDS.
  • We average about 500 connections per second to the RDS
  • We have around 0 - 4 hiccups per RDS per day
  • The hiccups don't correspond with our maintenance window
  • When we hit a hiccup it can affect quite a few connections ~50
  • When a hiccup happens it will disrupt connections across all servers and ports

The error itself looks to be generated from the tcp connection being closed on the ec2. Our TCP keep alive time is set to 7200 seconds and that's when the error is fired off.

My question is what can be done to track down why these hiccups happen? It's great that they don't happen often, but it's not ideal that they happen at all.

Any advice would be appreciated thanks!

Update 10/29:

I've been running a service checking to see if I have any long processes running on the sql server and it looks like these errors aren't getting that far. A new process is never created for this connection! I have still been receiving the hiccups, just no signs of connections.

like image 378
Zach Avatar asked Oct 24 '14 20:10

Zach


1 Answers

So after a back and forth with amazon support here is the current solution we have come to.

Amazon has raised our socket listen backlog by adjusting the somaxconn value on the RDS instance.

The value was at the default of 128 and has been bumped up to 1024.

Once the value was adjusted we no longer received the Lost Connection error.

like image 70
Zach Avatar answered Sep 22 '22 07:09

Zach