We have MySQL 5.7 master - slaves replications and on the slave servers side, it hapens from time to time that our application monitoring tools (Tideways and PHP7.0) are reporting
MySQL has gone away.
Checking the MYSQL side:
show global status like '%Connection%';
+-----------------------------------+----------+
| Variable_name | Value |
+-----------------------------------+----------+
| Connection_errors_accept | 0 |
| Connection_errors_internal | 0 |
| Connection_errors_max_connections | 0 |
| Connection_errors_peer_address | 323 |
| Connection_errors_select | 0 |
| Connection_errors_tcpwrap | 0 |
| Connections | 55210496 |
| Max_used_connections | 387 |
| Slave_connections | 0 |
+-----------------------------------+----------+
The Connection_errors_peer_address
shows 323. How to further investigate on what is causing this issue on both sides:
MySQL has gone away
and
Connection_errors_peer_address
EDIT:
Master Server
net_retry_count = 10
net_read_timeout = 120
net_write_timeout = 120
skip_networking = OFF
Aborted_clients = 151650
Slave Server 1
net_retry_count = 10
net_read_timeout = 30
net_write_timeout = 60
skip_networking = OFF
Aborted_clients = 3
Slave Server 2
net_retry_count = 10
net_read_timeout = 30
net_write_timeout = 60
skip_networking = OFF
Aborted_clients = 3
In MySQL 5.7, when a new TCP/IP connection reaches the server, the server performs several checks, implemented in sql/sql_connect.cc
in function check_connection()
One of these checks is to get the IP address of the client side connection, as in:
static int check_connection(THD *thd)
{
...
if (!thd->m_main_security_ctx.host().length) // If TCP/IP connection
{
...
peer_rc= vio_peer_addr(net->vio, ip, &thd->peer_port, NI_MAXHOST);
if (peer_rc)
{
/*
Since we can not even get the peer IP address,
there is nothing to show in the host_cache,
so increment the global status variable for peer address errors.
*/
connection_errors_peer_addr++;
my_error(ER_BAD_HOST_ERROR, MYF(0));
return 1;
}
...
}
Upon failure, the status variable connection_errors_peer_addr
is incremented, and the connection is rejected.
vio_peer_addr()
is implemented in vio/viosocket.c
(code simplified to show only the important calls)
my_bool vio_peer_addr(Vio *vio, char *ip_buffer, uint16 *port,
size_t ip_buffer_size)
{
if (vio->localhost)
{
...
}
else
{
/* Get sockaddr by socked fd. */
err_code= mysql_socket_getpeername(vio->mysql_socket, addr, &addr_length);
if (err_code)
{
DBUG_PRINT("exit", ("getpeername() gave error: %d", socket_errno));
DBUG_RETURN(TRUE);
}
/* Normalize IP address. */
vio_get_normalized_ip(addr, addr_length,
(struct sockaddr *) &vio->remote, &vio->addrLen);
/* Get IP address & port number. */
err_code= vio_getnameinfo((struct sockaddr *) &vio->remote,
ip_buffer, ip_buffer_size,
port_buffer, NI_MAXSERV,
NI_NUMERICHOST | NI_NUMERICSERV);
if (err_code)
{
DBUG_PRINT("exit", ("getnameinfo() gave error: %s",
gai_strerror(err_code)));
DBUG_RETURN(TRUE);
}
...
}
...
}
In short, the only failure path in vio_peer_addr()
happens when a call to mysql_socket_getpeername()
or vio_getnameinfo()
fails.
mysql_socket_getpeername() is just a wrapper on top of getpeername().
The man 2 getpeername
manual lists the following possible errors:
NAME
getpeername - get name of connected peer socket
ERRORS
EBADF The argument sockfd is not a valid descriptor. EFAULT The addr argument points to memory not in a valid part of the process address space. EINVAL addrlen is invalid (e.g., is negative). ENOBUFS Insufficient resources were available in the system to perform the operation. ENOTCONN The socket is not connected. ENOTSOCK The argument sockfd is a file, not a socket.
Of these errors, only ENOBUFS
is plausible.
As for vio_getnameinfo()
, it is just a wrapper on getnameinfo(), which also according to the man page man 3 getnameinfo
can fail for the following reasons:
NAME
getnameinfo - address-to-name translation in protocol-independent manner
RETURN VALUE
EAI_AGAIN The name could not be resolved at this time. Try again later. EAI_BADFLAGS The flags argument has an invalid value. EAI_FAIL A nonrecoverable error occurred. EAI_FAMILY The address family was not recognized, or the address length was invalid for the specified family. EAI_MEMORY Out of memory. EAI_NONAME The name does not resolve for the supplied arguments. NI_NAMEREQD is set and the host's name cannot be located, or neither
hostname nor service name were requested.
EAI_OVERFLOW The buffer pointed to by host or serv was too small. EAI_SYSTEM A system error occurred. The error code can be found in errno. The gai_strerror(3) function translates these error codes to a human readable string, suitable for error reporting.
Here many failures can happen, basically due to heavy load or the network.
To understand the process behind this code, what the MySQL server is essentially doing is a Reverse DNS lookup, to:
Overall, failures accounted with Connection_errors_peer_address
can be due to system load (causing transient failures like out of memory, etc) or due to network issues affecting DNS.
Disclosure: I happen to be the person who implemented this Connection_errors_peer_address
status variable in MySQL, as part of an effort to have better visibility / observability in this area of the code.
[Edit] To follow up with more details and/or guidelines:
Connection_errors_peer_address
is incremented, the root cause is not printed in logs. That is unfortunate for troubleshooting, but also avoid flooding logs causing even more damage, there is a tradeoff here. Keep in mind that anything that happen before logging in is very sensitive ...mysqld
, and monitoring the uptime
, it should be fairly easy to determine if the failure "only" caused connections to be closed with the server staying up, or if the server itself failed catastrophically.getnameinfo
.skip-name-resolve
will have no effect, as this check happens later (see specialflag & SPECIAL_NO_RESOLVE
in the code in check_connection()
)Connection_errors_peer_address
fails, note that the server cleanly returns the error ER_BAD_HOST_ERROR
to the client, and then closes the socket. This is different from just closing abruptly a socket (like in a crash) : the former should be reported by the client as "Can't get hostname for your address"
, while the later is reported as "MySQL has gone away"
.ER_BAD_HOST_ERROR
and a socket closed differently is another storyGiven that this failure overall seems related to DNS lookups, I would check the following items:
performance_schema.host_cache
table.host_cache_size
system variable.Table performance_schema.host_cache
documentation:
https://dev.mysql.com/doc/refman/5.7/en/host-cache-table.html
Further readings:
http://marcalff.blogspot.com/2012/04/performance-schema-nailing-host-cache.html
[Edit 2] Based on the new data available:
The Aborted_clients
status variable shows some connections forcefully closed by the server. This typically happens when a session is idle for a very long time.
A typical scenario for this to happen is:
Note that a client application forgetting to cleanly close sessions will execute 1-3, this could be the case for Aborted_clients on the master. Some cleanup here to fix clients applications using the master would help to decrease resource consumption, as leaving 151650 sessions open to die on timeout has a cost.
A client application executing 1-4 can cause Aborted_clients on the server and MySQL has gone away on the client. The client application reporting "MySQL has gone away" is most likely the culprit here.
If a monitoring application, say, checks the server every N seconds, then make sure the timeouts (here 30 and 60 sec) are significantly greater that N, or the server will kill the monitoring session.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With