A steady number of HBase requests are taking almost exactly 5000ms to complete (successfully) despite lower timeouts. No idea why

Question

My client is using org.apache.hbase:hbase-client:2.1.0" and the server is running 1.2.0-cdh5.11.1 (1.2.0-cdh5.11.0 in an alternative test cluster).

My client is very simple, it instantiates a Connection class at startup (this class - as recommended by Apache - is shared across threads since it's heavy and thread safe). Then for each request, it creates a Table class and does a .exists(new Get(rowKey)) operation.

So like this:

Connection conn = ConnectionFactory.createConnection(hbaseConfig);

and

final Table table = conn.getTable(tableName);
return table.exists(new Get(context.getKey()));

Most of the time the request latency to HBase and back is 40ms at worst. Usually it completes in around 10ms.

However, we're noticing occasional requests take around 5000ms (5s) - but still complete successfully!

And by occasional I mean around 1 request per min (out of 600 per minute total so a small rate). But it's steady.

These are almost exactly around 5s (+/- 100-200ms). That's the odd part. It's not a spike.

At first I suspected it was a misconfiguration of the client and I needed to set stricter timeouts so I set the following:

hbaseConfig.set(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 1);
hbaseConfig.set(HConstants.HBASE_CLIENT_PAUSE, "50");
hbaseConfig.set(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT, "2000");
hbaseConfig.set(HConstants.HBASE_RPC_TIMEOUT_KEY, "1500");
hbaseConfig.set(HConstants.HBASE_RPC_SHORTOPERATION_TIMEOUT_KEY, "2000");
hbaseConfig.set(HConstants.HBASE_CLIENT_SCANNER_TIMEOUT_PERIOD, "1500");
hbaseConfig.set(HConstants.ZOOKEEPER_RECOVERABLE_WAITTIME, "2000");
hbaseConfig.set(HConstants.ZK_SESSION_TIMEOUT, "2000");
hbaseConfig.set("zookeeper.recovery.retry", "1");
hbaseConfig.set("zookeeper.recovery.retry.intervalmill","200");
hbaseConfig.set("hbase.ipc.client.socket.timeout.connect", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.read", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.write", "2000");

In other words, 5000ms is way over the global timeout (as set in HConstants.HBASE_CLIENT_OPERATION_TIMEOUT).

Yet I have requests that take ~5s to complete - and does so successfully.

In addition to these timeouts I changed from using AsyncConnection to Connection (didn't need it to be async anyway) and am thinking about just making GET calls instead of exists.

But at this point I'm stumped. I'm not seeing any property and where the 5s thing is coming from. It's not even a timeout, it actually succeeds!

Has anyone encountered this before? Is there any way to get hbase-client to emit metrics? Server side metrics show no increase in latency (scan metrics).

ah.narayanan · Accepted Answer

For us, the thread dumps showed that the time was spent on performing DNS resolution using IPv6 implementation and also found a few issues talking about 5 seconds latency in the native library.

    hconnection-0x6fc1d215-shared--pool1-t95" 
   java.lang.Thread.State: RUNNABLE
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
        at java.net.InetAddress.getAllByName(InetAddress.java:1193)
        at java.net.InetAddress.getAllByName(InetAddress.java:1127)
        at java.net.InetAddress.getByName(InetAddress.java:1077)
        at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getStubKey(ConnectionManager.java:1802)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getClient(ConnectionManager.java:1772)
        at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:163)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:409)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:134)
        at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)"Timer-47"

https://github.com/netty/netty/issues/6454

Adding -Djava.net.preferIPv4Stack=true to force using IPv4 stack seems to have fixed this issue for us. We are not seeing the 5 seconds latency anymore (fingers crossed)

A steady number of HBase requests are taking almost exactly 5000ms to complete (successfully) despite lower timeouts. No idea why

Tags:

java

performance

latency

hbase

hbase-client

encee

1 Answers

ah.narayanan

Recent Activity

Donate For Us

A steady number of HBase requests are taking almost exactly 5000ms to complete (successfully) despite lower timeouts. No idea why

Tags:

java

performance

latency

hbase

hbase-client

encee

1 Answers

ah.narayanan

Related questions

Recent Activity

Donate For Us