Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Flyway regularly hangs (MariaDB connector, RDS)

I've been seeing frequent hangs on deployment, at the migration step. Java/Scala application packaged in WAR for Tomcat. Database is RDS Aurora using MariaDB connector (https://downloads.mariadb.org/connector-java/).

Probably has nothing to do with Flyway but is a generic problem getting a connection.

Migration is run from shell in container:

java -cp `echo WEB-INF/lib/*|tr ' ' :` foo.Migrate

Migration code looks like:

def main(args: Array[String]): Unit = {
  Environment.dbFlywayPassword.foreach { pass =>
    val flyway = new Flyway
    flyway.setDataSource(Environment.jdbcUrl, "flyway", pass)
    flyway.migrate
  }
}

Connection string:

jdbc:mysql:aurora://%RDS_HOST%/xxx?serverSslCert=/rds-ca-2015-root.pem&useSSL=true&connectTimeout=10000

I've tried increasing logging level in Flyway, but nothing is logged after this line:

15:57:35.115 [main] INFO  o.f.c.internal.util.VersionPrinter - Flyway 4.2.0 by Boxfuse

So I got a thread dump, that looks like this:

15:57:35.115 [main] INFO  o.f.c.internal.util.VersionPrinter - Flyway 4.2.0 by Boxfuse
2017-06-08 15:57:56
Full thread dump OpenJDK 64-Bit Server VM (25.121-b13 mixed mode):

"MariaDb-failover-1" #8 daemon prio=5 os_prio=0 tid=0x00005555f80ae000 nid=0x14 waiting on condition [0x00007fc330b8f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000f5c59b10> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"Service Thread" #7 daemon prio=9 os_prio=0 tid=0x00005555f70bf000 nid=0x12 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00005555f7063000 nid=0x11 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00005555f7060800 nid=0x10 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00005555f705e800 nid=0xf waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00005555f702f000 nid=0xe in Object.wait() [0x00007fc331616000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000f5a30c58> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
    - locked <0x00000000f5a30c58> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00005555f702c000 nid=0xd in Object.wait() [0x00007fc331717000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    - waiting on <0x00000000f5a30e10> (a java.lang.ref.Reference$Lock)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
    - locked <0x00000000f5a30e10> (a java.lang.ref.Reference$Lock)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"main" #1 prio=5 os_prio=0 tid=0x00005555f6f85000 nid=0xb runnable [0x00007fc34341d000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
    - locked <0x00000000f0a66090> (a java.lang.Object)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
    - locked <0x00000000f0a81eb0> (a sun.security.ssl.AppInputStream)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    - locked <0x00000000f06b5118> (a java.io.BufferedInputStream)
    at org.mariadb.jdbc.internal.io.input.StandardPacketInputStream.getPacketArray(StandardPacketInputStream.java:125)
    at org.mariadb.jdbc.internal.io.input.StandardPacketInputStream.getPacket(StandardPacketInputStream.java:95)
    at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.readPacket(AbstractQueryProtocol.java:1002)
    at org.mariadb.jdbc.internal.protocol.AbstractQueryProtocol.getResult(AbstractQueryProtocol.java:982)
    at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.readRequestSessionVariables(AbstractConnectProtocol.java:498)
    at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.readPipelineAdditionalData(AbstractConnectProtocol.java:544)
    at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:410)
    at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:357)
    at org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:149)
    at org.mariadb.jdbc.internal.failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:179)
    at org.mariadb.jdbc.internal.failover.impl.MastersSlavesListener.initializeConnection(MastersSlavesListener.java:154)
    at org.mariadb.jdbc.internal.failover.FailoverProxy.<init>(FailoverProxy.java:94)
    at org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:464)
    at org.mariadb.jdbc.Driver.connect(Driver.java:103)
    at org.flywaydb.core.internal.util.jdbc.DriverDataSource.getConnectionFromDriver(DriverDataSource.java:416)
    at org.flywaydb.core.internal.util.jdbc.DriverDataSource.getConnection(DriverDataSource.java:381)
    at org.flywaydb.core.internal.util.jdbc.JdbcUtils.openConnection(JdbcUtils.java:51)
    at org.flywaydb.core.Flyway.execute(Flyway.java:1418)
    at org.flywaydb.core.Flyway.migrate(Flyway.java:971)
    at tgam.service.data.Migrate$.$anonfun$main$1(Migrate.scala:11)
    at foo.Migrate$.$anonfun$main$1$adapted(Migrate.scala:8)
    at foo.Migrate$$$Lambda$4/458209687.apply(Unknown Source)
    at scala.Option.foreach(Option.scala:257)
    at foo.Migrate$.main(Migrate.scala:8)
    at foo.Migrate.main(Migrate.scala)

"VM Thread" os_prio=0 tid=0x00005555f7022000 nid=0xc runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00005555f70da800 nid=0x13 waiting on condition

JNI global references: 232

Heap
 def new generation   total 4928K, used 1611K [0x00000000f0600000, 0x00000000f0b50000, 0x00000000f5950000)
  eden space 4416K,  24% used [0x00000000f0600000, 0x00000000f0712c10, 0x00000000f0a50000)
  from space 512K, 100% used [0x00000000f0a50000, 0x00000000f0ad0000, 0x00000000f0ad0000)
  to   space 512K,   0% used [0x00000000f0ad0000, 0x00000000f0ad0000, 0x00000000f0b50000)
 tenured generation   total 10944K, used 4187K [0x00000000f5950000, 0x00000000f6400000, 0x0000000100000000)
   the space 10944K,  38% used [0x00000000f5950000, 0x00000000f5d66e78, 0x00000000f5d67000, 0x00000000f6400000)
 Metaspace       used 13061K, capacity 13306K, committed 13568K, reserved 1060864K
  class space    used 1381K, capacity 1449K, committed 1536K, reserved 1048576K

Looks like an I/O hang in org.mariadb.jdbc.Driver.connect, but I do have a connectTimeout set (10 seconds). This timeout doesn't seem to be effective (would I need a socketTimeout per https://github.com/brettwooldridge/HikariCP/issues/754 ?)

This has been happening for a while. The same thing happened when I was using Tomcat's contextInitialized hook to do migrations. I decided to refactor out into a separate invocation before starting Tomcat, which looks like a better idea in general, but it hasn't affected this behaviour.

What will typically happen is that the code will hang, after 2-3 minutes ECS will timeout, and trigger a redeploy. After some number of these retries (e.g. up to 10), Flyway will run successfully and the service will start.

like image 869
qu1j0t3 Avatar asked Dec 12 '25 04:12

qu1j0t3


1 Answers

OK, seems this is a known bug in RDS Aurora - and it's alluded to in the MariaDB Connector doc (surely it should be a runtime warning, though!)

https://mariadb.com/kb/en/mariadb/about-mariadb-connector-j/#infrequently-used

usePipelineAuth Not compatible with aurora During connection, different queries are executed. When option is active those queries are send using pipeline (all queries are send, then only all results are reads), permitting faster connection creation. Default: true. Since 1.6.0

Also kudos to wlad_ in Freenode #maria who pointed me in the right direction.

like image 149
qu1j0t3 Avatar answered Dec 13 '25 18:12

qu1j0t3