How to Scan HBase Rows efficiently

Question

I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.

My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.

My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?

Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?

Thanks -Panks

obh · Accepted Answer

A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.

One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.

Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.

Chris Shain · Answer

You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.

How to Scan HBase Rows efficiently

Tags:

mapreduce

hbase

Panks

2 Answers

obh

Chris Shain

Recent Activity

Donate For Us

How to Scan HBase Rows efficiently

Tags:

mapreduce

hbase

Panks

2 Answers

obh

Chris Shain

Related questions

Recent Activity

Donate For Us