I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.
My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.
My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?
Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?
Thanks -Panks
A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.
One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.
Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.
You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With