I'm executing MR over HBase.
The business logic in the reducer heavily accesses two tables, say T1(40k rows) and T2(90k rows). Currently, I'm executing the following steps :
1.In the constructor of the reducer class, doing something like this :
HBaseCRUD hbaseCRUD = new HBaseCRUD();
HTableInterface t1= hbaseCRUD.getTable("T1",
"CF1", null, "C1", "C2");
HTableInterface t2= hbaseCRUD.getTable("T2",
"CF1", null, "C1", "C2");
In the reduce(...)
String lowercase = ....;
/* Start : HBase code */
/*
* TRY using get(...) on the table rather than a
* Scan!
*/
Scan scan = new Scan();
scan.setStartRow(lowercase.getBytes());
scan.setStopRow(lowercase.getBytes());
/*scan will return a single row*/
ResultScanner resultScanner = t1.getScanner(scan);
for (Result result : resultScanner) {
/*business logic*/
}
Though not sure if the above code is sensible in first place, I have a question - would a get(...) provide any performance benefit over the scan?
Get get = new Get(lowercase.getBytes());
Result getResult = t1.get(get);
Since T1 and T2 will be read-only(mostly), I think if kept in-memory, the performance will improve. As per HBase doc., I will have to re-create the tables T1 and T2. Please verify the correctness of my understanding :
public void createTables(String tableName, boolean readOnly,
boolean blockCacheEnabled, boolean inMemory,
String... columnFamilyNames) throws IOException {
// TODO Auto-generated method stub
HTableDescriptor tableDesc = new HTableDescriptor(tableName);
/* not sure !!! */
tableDesc.setReadOnly(readOnly);
HColumnDescriptor columnFamily = null;
if (!(columnFamilyNames == null || columnFamilyNames.length == 0)) {
for (String columnFamilyName : columnFamilyNames) {
columnFamily = new HColumnDescriptor(columnFamilyName);
/*
* Start : Do these steps ensure that the column
* family(actually, the column data) is in-memory???
*/
columnFamily.setBlockCacheEnabled(blockCacheEnabled);
columnFamily.setInMemory(inMemory);
/*
* End : Do these steps ensure that the column family(actually,
* the column data) is in-memory???
*/
tableDesc.addFamily(columnFamily);
}
}
hbaseAdmin.createTable(tableDesc);
hbaseAdmin.close();
}
Once done :
When you compare a partial key scan and a get, remember that the row key you use for Get can be a much longer string than the partial key you use for the scan. In that case, for the Get, HBase has to do a deterministic lookup to ascertain the exact location of the row key that it needs to match and fetch it.
would a get(...) provide any performance benefit over the scan?
Get operates directly on a particular row identified by the rowkey passed as a parameter to the the Get instance. While Scan operates on all the rows, if you haven't used range query by providing start and end rowkeys to your Scan instance. Clearly it is more efficient if you know it beforehand which row to operate on. You can directly go there and perform the desired operation.
How to verify that the columns are in-memory(of course, the describe statement and the browser reflect it) and accessed from there and not the disk?
You can use isInMemory() method provided by HColumnDescriptor to verify if a particular CF is in-memory or not. But, you cannot find out that the entire table is in memory and whether fetch is happening from disk or the memory. Although in-memory blocks have the highest priority, but it is not 100% sure that everything is in-memory all the time. One important thing here is that data is persisted to disk even in case of in-memory CF.
Is the from-memory or from-disk read transparent to the client? In simple words, do I need to change the HTable access code in my reducer class? If yes, what are the changes?
Yes. It is totally transparent. You don't have to do anything extra.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With