Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hbase FuzzyRowFilter how jumping of keys work

I know that fuzzy row filter takes two parameters first being row key and second being fuzzy logic. What i understood from the corresponding java class FuzzyRowFilter is, the filter evaluates the current row and try to compute the next higher row key that will match the fuzzy logic and it jumps the non matching keys.

I am unable to understand following things

How scan jumps certain row keys? Does it use Get to get and compare the current row key. How scan get to know where the next matching row key exists? without doing a full scan(if it jumps)

like image 366
Vikram Singh Chandel Avatar asked Feb 03 '14 12:02

Vikram Singh Chandel


1 Answers

You understood everything correctly.

For those who came here from web-search here are two links that explains how row skipping can be leveraged in general and how it's done in FuzzyRowFilter in particular

  1. HBase FuzzyRowFilter: Alternative to Secondary Indexes
  2. Filters in HBase (or intra row scanning part II)

If a filter knows it's at the last key and needs a skip:

  1. Filter returns SEEK_NEXT_USING_HINT
  2. Region Server calls getNextCellHint which returns a suggested Cell
  3. Region Server performs exactly same routine of finding a key as it did for the first key in scan - it examines available HFiles checking if the key in question is there
    1. Region Server reads the "trailer" section of each file to get offsets of metadatablocks
    2. Region Server reads Meta and FileInfo metadata block types to avoid reading the binary data from the hfile if there’s no chance that the key is present (Bloom Filter), if the file is too old (Max SequenceId) or if the file is too new (Timerange) to contain what we’re looking for. See more about HFile format here
    3. Should the key be inside the HFile, Region Server uses DataBlock index segments to compute offset of to the location of the datablock with has the key in question
    4. if the datablock with the key happens already be in the Region Server block cache, next step is skipped
    5. Datablock is read from HFile
    6. Region Server finally scans keys, one-by-one until it hits the target one
  4. The found key, and potentially whole row (depending on the filter), is passed to the filter code
  5. Whole cycle repeats
like image 167
Igor Katkov Avatar answered Nov 10 '22 21:11

Igor Katkov