Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve HBase Scanner?

Tags:

hbase

Ho do I configure HBase so that the scanner only retrieves a number of records at a time? Or how do I improve the scanner when the database contains a lot of records/

like image 421
WackoMax Avatar asked May 04 '10 07:05

WackoMax


People also ask

Which HBase component can offer fastest reads?

Column Family: Column families are a combination of several columns. A single request to read a column family gives access to all the columns in that family, making it quicker and easier to read data.

What is HBase not good for?

HBase is not optimized for classic transactional applications or even relational analytics. It is also not a complete substitute for HDFS when doing large batch MapReduce.

What are the disadvantages of HBase?

Disadvantages of HBase HBase cannot perform functions like SQL. It doesn't support SQL structure, so it does not contain any query optimizer. HBase is CPU and Memory intensive with large sequential input or output access while as Map Reduce jobs are primarily input or output bound with fixed memory.


3 Answers

I believe the scanner only actually requests one item at a time unless you set the caching. You can check just to be sure with getCaching()

Each time you call ResultScanner#next(), it will retrieve the next item. You can also use ResultScanner#next(int) to retrieve a number of results at a time.

When setting up the scanner you can use Scan#setCaching to retrieve results in advance http://hadoop.apache.org/hbase/docs/r0.20.4/api/org/apache/hadoop/hbase/client/Scan.html#setCaching(int)

The chances are your scanner is slow because you are only reading one record at a time(that includes all of the back and forth of the RPC protocol and whatnot). So if you intend to read a lot, let the system cache a few results for you in advance.

like image 167
juhanic Avatar answered Nov 10 '22 00:11

juhanic


You may also want to examine the Filter API, which allows you to selectively return a subset of rows or cells to the client: http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/filter/package-summary.html.

like image 28
Jeff Hammerbacher Avatar answered Nov 10 '22 00:11

Jeff Hammerbacher


You can use scan.setMaxResultSize to control the records retrieved from HBase every time. (It does not mean then you get less results from this query)

If you want to limit the result like SQL select top 100 from TABLE; You need to use a PageFilter ^_^

like image 38
leehoawki Avatar answered Nov 10 '22 01:11

leehoawki