Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HBase Scan with Multiple Ranges

I have a HBase table, and I need to get the result from several ranges. For example, I may need get data from different ranges like row 1-6, 100-150,..... I know that for each scan, I can define the start row and stop row. But if I have 6 ranges, I need to do scan 6 times. Is there any way that I can get the result from multiple ranges just from one scan or from one RPC? My HBase version is 0.98.

like image 653
Cheng Chen Avatar asked Oct 29 '15 20:10

Cheng Chen


1 Answers

Filter to support scan multiple row key ranges. It can construct the row key ranges from the passed list which can be accessed by each region server.

HBase is quite efficient when scanning only one small row key range. If user needs to specify multiple row key ranges in one scan, the typical solutions are:

  1. through FilterList which is a list of row key Filters,
  2. using the SQL layer over HBase to join with two table, such as hive, phoenix etc. However, both solutions are inefficient.

    Both of them can't utilize the range info to perform fast forwarding during scan which is quite time consuming. If the number of ranges are quite big (e.g. millions), join is a proper solution though it is slow.
    However, there are cases that user wants to specify a small number of ranges to scan (e.g. <1000 ranges). Both solutions can't provide satisfactory performance in such case.

MultiRowRangeFilter is to support such usec ase (scan multiple row key ranges), which can construct the row key ranges from user
specified list and perform fast-forwarding during scan. Thus, the scan will be quite efficient.

package chengchen;

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.MultiRowRangeFilter;
import org.apache.hadoop.hbase.filter.MultiRowRangeFilter.RowKeyRange;
import org.apache.hadoop.hbase.util.Bytes;



public class MultiRowRangeFilterTest {
    public static void main(String[] args) throws Exception {
        if (args.length < 1) {
            throw new Exception("Table name not specified.");
        }
        Configuration conf = HBaseConfiguration.create();
        HTable table = new HTable(conf, args[0]);

        TimeCounter executeTimer = new TimeCounter();
        executeTimer.begin();
        executeTimer.enter();
        Scan scan = new Scan();
        List<RowKeyRange> ranges = new ArrayList<RowKeyRange>();
        ranges.add(new RowKeyRange(Bytes.toBytes("001"), Bytes.toBytes("002")));
        ranges.add(new RowKeyRange(Bytes.toBytes("003"), Bytes.toBytes("004")));
        ranges.add(new RowKeyRange(Bytes.toBytes("005"), Bytes.toBytes("006")));
        Filter filter = new MultiRowRangeFilter(ranges);
        scan.setFilter(filter);
        int count = 0;
        ResultScanner scanner = table.getScanner(scan);
        Result r = scanner.next();
        while (r != null) {
            count++;
            r = scanner.next();
        }
        System.out
                .println("++ Scanning finished with count : " + count + " ++");
        scanner.close();


    }

}

Please see this test case for implementing in java

Note : However, This kind of requirements SOLR or ES is the best way in my opinion... you can check my answer with solr for high level architecture overview. Im suggesting that since hbase scan for huge data will be very slow.

like image 104
Ram Ghadiyaram Avatar answered Sep 24 '22 21:09

Ram Ghadiyaram