Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A steady number of HBase requests are taking almost exactly 5000ms to complete (successfully) despite lower timeouts. No idea why

My client is using org.apache.hbase:hbase-client:2.1.0" and the server is running 1.2.0-cdh5.11.1 (1.2.0-cdh5.11.0 in an alternative test cluster).

My client is very simple, it instantiates a Connection class at startup (this class - as recommended by Apache - is shared across threads since it's heavy and thread safe). Then for each request, it creates a Table class and does a .exists(new Get(rowKey)) operation.

So like this:

Connection conn = ConnectionFactory.createConnection(hbaseConfig);

and

final Table table = conn.getTable(tableName);
return table.exists(new Get(context.getKey()));

Most of the time the request latency to HBase and back is 40ms at worst. Usually it completes in around 10ms.

However, we're noticing occasional requests take around 5000ms (5s) - but still complete successfully!

And by occasional I mean around 1 request per min (out of 600 per minute total so a small rate). But it's steady.

These are almost exactly around 5s (+/- 100-200ms). That's the odd part. It's not a spike.

At first I suspected it was a misconfiguration of the client and I needed to set stricter timeouts so I set the following:

hbaseConfig.set(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 1);
hbaseConfig.set(HConstants.HBASE_CLIENT_PAUSE, "50");
hbaseConfig.set(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT, "2000");
hbaseConfig.set(HConstants.HBASE_RPC_TIMEOUT_KEY, "1500");
hbaseConfig.set(HConstants.HBASE_RPC_SHORTOPERATION_TIMEOUT_KEY, "2000");
hbaseConfig.set(HConstants.HBASE_CLIENT_SCANNER_TIMEOUT_PERIOD, "1500");
hbaseConfig.set(HConstants.ZOOKEEPER_RECOVERABLE_WAITTIME, "2000");
hbaseConfig.set(HConstants.ZK_SESSION_TIMEOUT, "2000");
hbaseConfig.set("zookeeper.recovery.retry", "1");
hbaseConfig.set("zookeeper.recovery.retry.intervalmill","200");
hbaseConfig.set("hbase.ipc.client.socket.timeout.connect", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.read", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.write", "2000");

In other words, 5000ms is way over the global timeout (as set in HConstants.HBASE_CLIENT_OPERATION_TIMEOUT).

Yet I have requests that take ~5s to complete - and does so successfully.

In addition to these timeouts I changed from using AsyncConnection to Connection (didn't need it to be async anyway) and am thinking about just making GET calls instead of exists.

But at this point I'm stumped. I'm not seeing any property and where the 5s thing is coming from. It's not even a timeout, it actually succeeds!

Has anyone encountered this before? Is there any way to get hbase-client to emit metrics? Server side metrics show no increase in latency (scan metrics).

like image 410
encee Avatar asked Feb 12 '19 07:02

encee


1 Answers

For us, the thread dumps showed that the time was spent on performing DNS resolution using IPv6 implementation and also found a few issues talking about 5 seconds latency in the native library.

    hconnection-0x6fc1d215-shared--pool1-t95" 
   java.lang.Thread.State: RUNNABLE
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
        at java.net.InetAddress.getAllByName(InetAddress.java:1193)
        at java.net.InetAddress.getAllByName(InetAddress.java:1127)
        at java.net.InetAddress.getByName(InetAddress.java:1077)
        at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getStubKey(ConnectionManager.java:1802)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getClient(ConnectionManager.java:1772)
        at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:163)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:409)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:134)
        at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)"Timer-47" 

https://github.com/netty/netty/issues/6454

Adding -Djava.net.preferIPv4Stack=true to force using IPv4 stack seems to have fixed this issue for us. We are not seeing the 5 seconds latency anymore (fingers crossed)

like image 74
ah.narayanan Avatar answered Nov 13 '22 04:11

ah.narayanan