Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Configure GlassFish JDBC connection pool to handle Amazon RDS Multi-AZ failover

I have a Java EE application running in GlassFish on EC2, with a MySQL database on Amazon RDS. I am trying to configure the JDBC connection pool to in order to minimize downtime in case of database failover.

My current configuration isn't working correctly during a Multi-AZ failover, as the standby database instance appears to be available in a couple of minutes (according to the AWS console) while my GlassFish instance remains stuck for a long time (about 15 minutes) before resuming work.

The connection pool is configured like this:

asadmin create-jdbc-connection-pool --restype javax.sql.ConnectionPoolDataSource \
--datasourceclassname com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource \
--isconnectvalidatereq=true --validateatmostonceperiod=60 --validationmethod=auto-commit \
--property user=$DBUSER:password=$DBPASS:databaseName=$DBNAME:serverName=$DBHOST:port=$DBPORT \
MyPool

If I use a Single-AZ db.m1.small instance and reboot the database from the console, GlassFish will invalidate the broken connections, throw some exceptions and then reconnect as soon the database is available. In this setup I get less than 1 minute of downtime.

If I use a Multi-AZ db.m1.small instance and reboot with failover from the AWS console, I see no exception at all. The server halts completely, with all incoming requests timing out. After 15 minutes I finally get this:

Communication failure detected when attempting to perform read query outside of a transaction. Attempting to retry query. Error was: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.3.2.v20111125-r10461): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet successfully received from the server was 940,715 milliseconds ago.  The last packet sent successfully to the server was 935,598 milliseconds ago.

It appears as if each HTTP thread gets blocked on an invalid connection without getting an exception and so there's no chance to perform connection validation.

Downtime in the Multi-AZ case is always between 15-16 minutes, so it looks like a timeout of some sort but I was unable to change it.

Things I have tried without success:

  • connection leak timeout/reclaim
  • statement leak timeout/reclaim
  • statement timeout
  • using a different validation method
  • using MysqlDataSource instead of MysqlConnectionPoolDataSource

How can I set a timeout on stuck queries so that connections in the pool are reused, validated and replaced? Or how can I let GlassFish detect a database failover?

like image 452
Andrea Avatar asked Jul 25 '14 14:07

Andrea


1 Answers

As I commented before, it is because the sockets that are open and connected to the database don't realize the connection has been lost, so they stayed connected until the OS socket timeout is triggered, which I read might be usually in about 30 minutes.

To solve the issue you need to override the socket Timeout in your JDBC Connection String or in the JDNI COnnection Configuration/Properties to define the socketTimeout param to a smaller time.

Keep in mind that any connection longer than the value defined will be killed, even if it is being used (I haven't been able to confirm this, is what I read).

The other two parameters I mention in my comment are connectTimeout and autoReconnect.

Here's my JDBC Connection String:

jdbc:(...)&connectTimeout=15000&socketTimeout=60000&autoReconnect=true 

I also disabled Java's DNS cache by doing

 java.security.Security.setProperty("networkaddress.cache.ttl" , "0"); 
 java.security.Security.setProperty("networkaddress.cache.negative.ttl" , "0"); 

I do this because Java doesn't honor the TTL's, and when the failover takes place, the DNS is the same but the IP changes.

Since you are using an Application Server, the parameters to disable DNS cache must be passed to the JVM when starting the glassfish with -Dnet and not the application itself.

like image 179
hectorg87 Avatar answered Sep 28 '22 10:09

hectorg87