I am using the curator framework to connect to a zookeeper server, but running into weird DNS resolution issue. Here is the jstack dump for the thread,
#21 prio=5 os_prio=0 tid=0x0000000001888800 nid=0x3a46 runnable [0x00007f25e69f3000]
java.lang.Thread.State: RUNNABLE
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at org.apache.zookeeper.client.StaticHostProvider.resolveAndShuffle(StaticHostProvider.java:117)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:81)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1096)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1006)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:804)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:679)
at com.netflix.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:72)
- locked <0x00000000fd761f40> (a com.netflix.curator.HandleHolder$1)
at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:46)
at com.netflix.curator.ConnectionState.reset(ConnectionState.java:122)
at com.netflix.curator.ConnectionState.start(ConnectionState.java:95)
at com.netflix.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:137)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:167)
The thread seems to be stuck in the native method and never returns. Also it occurs very randomly, so haven't been able to reproduce consistently. Any ideas?
We are also trying to solve this problem. Looks like this is due to glibc bug: https://bugzilla.kernel.org/show_bug.cgi?id=99671 or the kernel bug: https://bugzilla.redhat.com/show_bug.cgi?id=1209433 depending on who you ask ;)
Also worth reading: https://access.redhat.com/security/cve/cve-2013-7423 and https://alas.aws.amazon.com/ALAS-2015-617.html
To confirm that this is indeed the case attach gdb to the java process:
gdb --pid <JavaProcessPid>
then from gdb:
info threads
find a thread that does recvmsg:
thread <HangingThreadId>
and then
backtrace
and if you see something like this then you know that glibc/kernel upgrade will help:
#0 0x00007fc726ff27cd in recvmsg () from /lib64/libc.so.6
#1 0x00007fc727018765 in make_request () from /lib64/libc.so.6
#2 0x00007fc727018b9a in __check_pf () from /lib64/libc.so.6
#3 0x00007fc726fdbd57 in getaddrinfo () from /lib64/libc.so.6
#4 0x00007fc706dd9635 in Java_java_net_Inet6AddressImpl_lookupAllHostAddr () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-0.b17.el6_7.x86_64/jre/lib/amd64/libnet.so
Update: Looks like the kernel wins. Please see this thread: http://www.gossamer-threads.com/lists/linux/kernel/2264958 for details. Also there is a tool to verify that your system is affected by the kernel bug you can use this simple program: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473
to verify:
curl -o pf_dump.c https://gist.githubusercontent.com/stevenschlansker/6ad46c5ccb22bc4f3473/raw/22cfe72f6708de1e3468c1e0fa3888aafae42db4/pf_dump.c
gcc pf_dump.c -pthread -o pf_dump
./pf_dump
And if the output is:
[26170] glibc: check_pf: netlink socket read timeout
Aborted
Then the system is affected. If the output is something like:
exit success [7618] exit success [7265] exit success
then the system is ok. In the AWS context, upgrading AMIs to (2016.3.2) with the new kernel seems to have fixed the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With