Problem
I am trying to build a secondary index with Phoenix. Index creation takes several hours. It seems to be due to slow HBase scans, as I noticed the following performance :
Both with HBase shell and a Java scanner.
NB : The GET(by rowkey) operation is achieved with good performances (approx 0.5s).
Context
Troubleshooting
Based on the HBase book (http://hbase.apache.org/book.html#performance), here is what I already checked :
1) Hardware
2) JVM
3) Data
4) HBase configuration
Most configuration remained unchanged.
Some of the params that looked OK to me
Tried following configuration changes without any impact on performance
5) Log say nothing usefull
cat hbase-hbase-master-cox.log | grep "2015-05-11.*ERROR"
cat hbase-hbase-regionserver-*.log | grep "2015-05-11.*ERROR"
print nothing
Printing WARNs shows non related errors
2015-05-11 17:11:10,544 WARN [B.DefaultRpcServer.handler=8,queue=2,port=60020] shortcircuit.ShortCircuitCache: ShortCircuitCache(0x2aca5fca): could not load 1074749724_BP-2077371184-184.10.17.65-1423758745093 due to InvalidToken exception.
2015-05-11 17:09:12,848 WARN [regionserver60020-smallCompactions-1430754386533] hbase.HBaseConfiguration: Config option "hbase.regionserver.lease.period" is deprecated. Instead, use "hbase.client.scanner.timeout.period"
When you compare a partial key scan and a get, remember that the row key you use for Get can be a much longer string than the partial key you use for the scan. In that case, for the Get, HBase has to do a deterministic lookup to ascertain the exact location of the row key that it needs to match and fetch it.
Scaning using HBase Shell The scan command is used to view the data in HTable. Using the scan command, you can get the table data. Its syntax is as follows: scan '<table name>'
Got it : the key is to separate "hot" content from "cold" content into separate column families. Column families are used to store columns in separate HFiles, so we can use one column family for indexed (or frequently read) columns, and one other column family (thus file) for all other columns.
First step : see that smaller column family is faster to scan
We simply discard cold content to build a single smaller column family (1655 columns -> 7 columns).
Performances on medium size table scans :
Remarks :
Second step : generate multi-family HTable
We do a bulk load by generating HFiles from Hive. Although the doc says we can't generate one multi family table, one can generate HFiles separately :
create table mytable_f1 (UUID string, source_col1, source_col2)
...
TBLPROPERTIES('hfile.family.path' = 'tmp/mytable/**f1**');
create table mytable_f1 (UUID string, source_col3, source_col4)
...
TBLPROPERTIES('hfile.family.path' = 'tmp/mytable/f2');
And then simply call import command as usual :
hadoop jar [hbase-server-jar] completebulkload /tmp/mytable mytable
TurnOff blockcache at time of scan (it is churning ur heap memory)
Figure out whats the size of ur record , if its > 1 MB , please increase hbase.scanner.timeout period scan.setCacheBlocks(false);
scan.setCaching(x) x * record size what is getting fetched in one short , make sure it is close to 1 MB .
some necessary check : make sure regions for the Tabled being scanned are equally distributed across regions .
(If u have done bulk load run a Major compaction Once )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With