Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance difference between Scan and Get?

Tags:

hbase

I have an HBase table containing 8G of data.

When I use a partial key scan on that table to retrieve a value for a given key I get almost constant time value retrieval.

When I use a Get, the time taken is far greater than with the scan. However when I looked inside the code, I found that Get itself uses a Scan.

Can anyone explain this time difference?

like image 939
Aniket Dutta Avatar asked Jan 27 '13 08:01

Aniket Dutta


2 Answers

Correct, when you issue a Get, there is a scan happening behind the scenes. Cloudera's blog post confirms this: "Each time a get or a scan is issued, HBase scan (sic) through each file to find the result."

I can't confirm your results, but I think the clue may lie in your "partial key scan". When you compare a partial key scan and a get, remember that the row key you use for Get can be a much longer string than the partial key you use for the scan.

In that case, for the Get, HBase has to do a deterministic lookup to ascertain the exact location of the row key that it needs to match and fetch it. But with the partial key, HBase does not need to lookup the exact key match, and just needs to find the more approximate location of that key prefix.

The answer for this is: it depends. I think it will depend on:

  1. Your row key "schema" or composition
  2. The length of the Get key and the Scan prefix
  3. How many regions you have

and possibly other factors.

like image 94
Suman Avatar answered Oct 20 '22 05:10

Suman


On the backend HRegion both Scan and Get amount to nearly the same thing. They both end up executed by HRegion.RegionScannerImpl. Note below that the get() within that class instantiates a RegionScanner - similarly to invoking a Scan.

org.apache.hadoop.hbase.regionserver.HRegion.RegionScannerImpl

public List<Cell> get(Get get, boolean withCoprocessor)
throws IOException {

List<Cell> results = new ArrayList<Cell>();

// pre-get CP hook
if (withCoprocessor && (coprocessorHost != null)) {
   if (coprocessorHost.preGet(get, results)) {
     return results;
   }
}

Scan scan = new Scan(get);

In the case of a get(), only a single row is returned - by invoking scanner.next() one time:

RegionScanner scanner = null;
try {
  scanner = getScanner(scan);
  scanner.next(results);
like image 29
WestCoastProjects Avatar answered Oct 20 '22 05:10

WestCoastProjects