Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tuning Hive Queries That Uses Underlying HBase Table

I've got a table in Hbase let's say "tbl" and I would like to query it using Hive. Therefore I mapped a table to hive as follows:

CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
TBLPROPERTIES("hbase.table.name" = "tbl");

Queries like:

select * from tbl", "select id from tbl", "select id, data
from tbl

are really fast.

But queries like

select id from tbl where substr(id, 0, 5) = "12345"

select id from tbl where data["777"] IS NOT NULL 

are incredibly slow.

In the contrary when running from Hbase shell:

"scan 'tbl', {
COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
"scan 'tbl', { COLUMNS=>'data', "FILTER" =>
FilterList.new([qualifierFilter('777')])}"

it is lightning fast!

When I looked into the mapred job generated by hive on jobtracker I discovered that "map.input.records" counts ALL the items in Hbase table, meaning the job makes a full table scan before it even starts any mappers!! Moreover, I suspect it copies all the data from Hbase table to hdfs to mapper tmp input folder before executuion.

So, my questions are - Why hbase storage handler for hive does not translate hive queries into appropriate hbase functions? Why it scans all the records and then slices them using "where" clause? How can it be improved?

Any suggestions to improve the performance of Hive queries(mapped to HBase Table).

Can we create secondary index on HBase tables?

We are using HBase and Hive integration and trying to tune the performance of Hive queries.

like image 325
Ram Avatar asked May 06 '15 10:05

Ram


1 Answers

Lots of questions!, I'll try to answer all and give you a few performance tips:

The data is not copied to the HDFS, but the mapreduce jobs generated by HIVE will store their intermediate data in the HDFS.

Secondary indexes or alternative query paths are not supported by HBase (more info).

Hive will translate everything into MapReduce jobs which need time to be distributed & initialized, if you have a very small number of rows its possible that a simple SCAN operation in the Hbase shell is faster than a Hive query but on big datasets, distributing the job among the datanodes is a must.

The Hive HBase handler doesn't do a very good job when extracting the start & stop row keys from the query, queries like substr(id, 0, 5) = "12345" won't use start & stop row keys.

Before executing your queries, run a EXPLAIN [your_query]; command and check for the filterExpr: part, if you don't find it, your query will perform a full table scan. On a side note, all expresions within the Filter Operator: will be transformed into the appropiate filters.

EXPLAIN SELECT * FROM tbl WHERE (id>='12345') AND (id<'12346')
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        tbl 
          TableScan
            alias: tbl 
            filterExpr:
                expr: ((id>= '12345') and (id < '12346'))
                type: boolean
            Filter Operator
                ....

Fortunately, there is an easy way to make sure start & stop row keys are used when you're looking for row-key prefixes, just convert substr(id, 0, 5) = "12345" to a simpler query: id>="12345" AND id<"12346", it will be detected by the handler and start & stop row keys will be provided to the SCAN (12345, 12346)


Now, here are a few tips in order to speed up your queries (by a lot):

  • Make sure you set the following properties to take advantage of batching to reduce the number of RPC calls (the number depends on the size of your columns)

    SET hbase.scan.cache=10000;

    SET hbase.client.scanner.cache=10000;

  • Make sure you set the following properties to run a distributed job in your task trackers instead of running local job.

    SET mapred.job.tracker=[YOUR_JOB_TRACKER]:8021;

    SET hbase.zookeeper.quorum=[ZOOKEEPER_NODE_1],[ZOOKEEPER_NODE_2],[ZOOKEEPER_NODE_3];

  • Reduce the amount of columns of your SELECT statement to the minimum. Try not to SELECT *

  • Whenever you want to use start & stop row keys to prevent full table scans, always provide key>=x and key<y expressions (don't use the BETWEEN operator)

  • Always EXPLAIN SELECT your queries before executing them.

like image 81
Rubén Moraleda Avatar answered Nov 16 '22 02:11

Rubén Moraleda