My client is using org.apache.hbase:hbase-client:2.1.0"
and the server is running 1.2.0-cdh5.11.1
(1.2.0-cdh5.11.0
in an alternative test cluster).
My client is very simple, it instantiates a Connection
class at startup (this class - as recommended by Apache - is shared across threads since it's heavy and thread safe). Then for each request, it creates a Table
class and does a .exists(new Get(rowKey))
operation.
So like this:
Connection conn = ConnectionFactory.createConnection(hbaseConfig);
and
final Table table = conn.getTable(tableName);
return table.exists(new Get(context.getKey()));
Most of the time the request latency to HBase and back is 40ms at worst. Usually it completes in around 10ms.
However, we're noticing occasional requests take around 5000ms (5s) - but still complete successfully!
And by occasional I mean around 1 request per min (out of 600 per minute total so a small rate). But it's steady.
These are almost exactly around 5s (+/- 100-200ms). That's the odd part. It's not a spike.
At first I suspected it was a misconfiguration of the client and I needed to set stricter timeouts so I set the following:
hbaseConfig.set(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 1);
hbaseConfig.set(HConstants.HBASE_CLIENT_PAUSE, "50");
hbaseConfig.set(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT, "2000");
hbaseConfig.set(HConstants.HBASE_RPC_TIMEOUT_KEY, "1500");
hbaseConfig.set(HConstants.HBASE_RPC_SHORTOPERATION_TIMEOUT_KEY, "2000");
hbaseConfig.set(HConstants.HBASE_CLIENT_SCANNER_TIMEOUT_PERIOD, "1500");
hbaseConfig.set(HConstants.ZOOKEEPER_RECOVERABLE_WAITTIME, "2000");
hbaseConfig.set(HConstants.ZK_SESSION_TIMEOUT, "2000");
hbaseConfig.set("zookeeper.recovery.retry", "1");
hbaseConfig.set("zookeeper.recovery.retry.intervalmill","200");
hbaseConfig.set("hbase.ipc.client.socket.timeout.connect", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.read", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.write", "2000");
In other words, 5000ms is way over the global timeout (as set in HConstants.HBASE_CLIENT_OPERATION_TIMEOUT
).
Yet I have requests that take ~5s to complete - and does so successfully.
In addition to these timeouts I changed from using AsyncConnection
to Connection
(didn't need it to be async anyway) and am thinking about just making GET
calls instead of exists
.
But at this point I'm stumped. I'm not seeing any property and where the 5s thing is coming from. It's not even a timeout, it actually succeeds!
Has anyone encountered this before? Is there any way to get hbase-client to emit metrics? Server side metrics show no increase in latency (scan metrics).
For us, the thread dumps showed that the time was spent on performing DNS resolution using IPv6 implementation and also found a few issues talking about 5 seconds latency in the native library.
hconnection-0x6fc1d215-shared--pool1-t95"
java.lang.Thread.State: RUNNABLE
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getByName(InetAddress.java:1077)
at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getStubKey(ConnectionManager.java:1802)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getClient(ConnectionManager.java:1772)
at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:163)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:409)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:134)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)"Timer-47"
https://github.com/netty/netty/issues/6454
Adding -Djava.net.preferIPv4Stack=true
to force using IPv4 stack seems to have fixed this issue for us. We are not seeing the 5 seconds latency anymore (fingers crossed)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With