I'm executing MR over HBase. The business logic in the reducer heavily accesses two tables, say T1(40k rows) and T2(90k rows). Currently, I'm executing the following steps : 1.In the constructor of the reducer class, doing something like this : <pre class="prettyprint"><code>HBaseCRUD hbaseCRUD = new HBaseCRUD(); HTableInterface t1= hbaseCRUD.getTable("T1", "CF1", null, "C1", "C2"); HTableInterface t2= hbaseCRUD.getTable("T2", "CF1", null, "C1", "C2"); </code></pre> In the reduce(...) <pre class="prettyprint"><code> String lowercase = ....; /* Start : HBase code */ /* * TRY using get(...) on the table rather than a * Scan! */ Scan scan = new Scan(); scan.setStartRow(lowercase.getBytes()); scan.setStopRow(lowercase.getBytes()); /*scan will return a single row*/ ResultScanner resultScanner = t1.getScanner(scan); for (Result result : resultScanner) { /*business logic*/ } </code></pre> Though not sure if the above code is sensible in first place, I have a question - would a get(...) provide any performance benefit over the scan? <pre class="prettyprint"><code>Get get = new Get(lowercase.getBytes()); Result getResult = t1.get(get); </code></pre> Since T1 and T2 will be read-only(mostly), I think if kept in-memory, the performance will improve. As per HBase doc., I will have to re-create the tables T1 and T2. Please verify the correctness of my understanding : <pre class="prettyprint"><code>public void createTables(String tableName, boolean readOnly, boolean blockCacheEnabled, boolean inMemory, String... columnFamilyNames) throws IOException { // TODO Auto-generated method stub HTableDescriptor tableDesc = new HTableDescriptor(tableName); /* not sure !!! */ tableDesc.setReadOnly(readOnly); HColumnDescriptor columnFamily = null; if (!(columnFamilyNames == null || columnFamilyNames.length == 0)) { for (String columnFamilyName : columnFamilyNames) { columnFamily = new HColumnDescriptor(columnFamilyName); /* * Start : Do these steps ensure that the column * family(actually, the column data) is in-memory??? */ columnFamily.setBlockCacheEnabled(blockCacheEnabled); columnFamily.setInMemory(inMemory); /* * End : Do these steps ensure that the column family(actually, * the column data) is in-memory??? */ tableDesc.addFamily(columnFamily); } } hbaseAdmin.createTable(tableDesc); hbaseAdmin.close(); } </code></pre> Once done : <ol> <li>How to verify that the columns are in-memory(of course, the describe statement and the browser reflect it) and accessed from there and not the disk?</li> <li>Is the from-memory or from-disk read transparent to the client? In simple words, do I need to change the HTable access code in my reducer class? If yes, what are the changes?</li> </ol>

would a get(...) provide any performance benefit over the scan? Get operates directly on a particular row identified by the rowkey passed as a parameter to the the Get instance. While Scan operates on all the rows, if you haven't used range query by providing start and end rowkeys to your Scan instance. Clearly it is more efficient if you know it beforehand which row to operate on. You can directly go there and perform the desired operation. How to verify that the columns are in-memory(of course, the describe statement and the browser reflect it) and accessed from there and not the disk? You can use isInMemory() method provided by HColumnDescriptor to verify if a particular CF is in-memory or not. But, you cannot find out that the entire table is in memory and whether fetch is happening from disk or the memory. Although in-memory blocks have the highest priority, but it is not 100% sure that everything is in-memory all the time. One important thing here is that data is persisted to disk even in case of in-memory CF. Is the from-memory or from-disk read transparent to the client? In simple words, do I need to change the HTable access code in my reducer class? If yes, what are the changes? Yes. It is totally transparent. You don't have to do anything extra.

<ol> <li>There is no substantial difference between these as far as implementation is concerned. They both are identical to client.</li> </ol>

HBase : get(...) vs scan and in-memory table

Tags:

hadoop

mapreduce

hbase

I'm executing MR over HBase.

The business logic in the reducer heavily accesses two tables, say T1(40k rows) and T2(90k rows). Currently, I'm executing the following steps :

1.In the constructor of the reducer class, doing something like this :

HBaseCRUD hbaseCRUD = new HBaseCRUD();

HTableInterface t1= hbaseCRUD.getTable("T1",
                            "CF1", null, "C1", "C2");
HTableInterface t2= hbaseCRUD.getTable("T2",
                            "CF1", null, "C1", "C2");

In the reduce(...)

 String lowercase = ....;

/* Start : HBase code */
/*
 * TRY using get(...) on the table rather than a
 * Scan!
 */
Scan scan = new Scan();
scan.setStartRow(lowercase.getBytes());
scan.setStopRow(lowercase.getBytes());

/*scan will return a single row*/
ResultScanner resultScanner = t1.getScanner(scan);

for (Result result : resultScanner) {
 /*business logic*/
}

Though not sure if the above code is sensible in first place, I have a question - would a get(...) provide any performance benefit over the scan?

Get get = new Get(lowercase.getBytes());
Result getResult = t1.get(get);

Since T1 and T2 will be read-only(mostly), I think if kept in-memory, the performance will improve. As per HBase doc., I will have to re-create the tables T1 and T2. Please verify the correctness of my understanding :

public void createTables(String tableName, boolean readOnly,
            boolean blockCacheEnabled, boolean inMemory,
            String... columnFamilyNames) throws IOException {
        // TODO Auto-generated method stub

        HTableDescriptor tableDesc = new HTableDescriptor(tableName);
        /* not sure !!! */
        tableDesc.setReadOnly(readOnly);

        HColumnDescriptor columnFamily = null;

        if (!(columnFamilyNames == null || columnFamilyNames.length == 0)) {

            for (String columnFamilyName : columnFamilyNames) {

                columnFamily = new HColumnDescriptor(columnFamilyName);
                /*
                 * Start : Do these steps ensure that the column
                 * family(actually, the column data) is in-memory???
                 */
                columnFamily.setBlockCacheEnabled(blockCacheEnabled);
                columnFamily.setInMemory(inMemory);
                /*
                 * End : Do these steps ensure that the column family(actually,
                 * the column data) is in-memory???
                 */

                tableDesc.addFamily(columnFamily);
            }
        }

        hbaseAdmin.createTable(tableDesc);
        hbaseAdmin.close();
    }

Once done :

How to verify that the columns are in-memory(of course, the describe statement and the browser reflect it) and accessed from there and not the disk?
Is the from-memory or from-disk read transparent to the client? In simple words, do I need to change the HTable access code in my reducer class? If yes, what are the changes?

405

asked Sep 11 '13 09:09

Kaliyug Antagonist

2 Answers

would a get(...) provide any performance benefit over the scan?

Get operates directly on a particular row identified by the rowkey passed as a parameter to the the Get instance. While Scan operates on all the rows, if you haven't used range query by providing start and end rowkeys to your Scan instance. Clearly it is more efficient if you know it beforehand which row to operate on. You can directly go there and perform the desired operation.

How to verify that the columns are in-memory(of course, the describe statement and the browser reflect it) and accessed from there and not the disk?

You can use isInMemory() method provided by HColumnDescriptor to verify if a particular CF is in-memory or not. But, you cannot find out that the entire table is in memory and whether fetch is happening from disk or the memory. Although in-memory blocks have the highest priority, but it is not 100% sure that everything is in-memory all the time. One important thing here is that data is persisted to disk even in case of in-memory CF.

Is the from-memory or from-disk read transparent to the client? In simple words, do I need to change the HTable access code in my reducer class? If yes, what are the changes?

Yes. It is totally transparent. You don't have to do anything extra.

answered Oct 21 '22 20:10

Tariq

There is no substantial difference between these as far as implementation is concerned. They both are identical to client.

answered Oct 21 '22 22:10

techvineet

Related questions
                            
                                Transferring files from remote node to HDFS with Flume
                            
                                Not a Valid Jar When Running Hadoop Job
                            
                                Hadoop job fails, Resource Manager doesnt recognize AttemptID
                            
                                Drop Hive external table WITHOUT removing data
                            
                                Connecting to a remote Spark master - Java / Scala
                            
                                hadoop WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                            
                                Read local Parquet file without Hadoop Path API
                            
                                Efficient way to delete multiple rows in HBase
                            
                                Hadoop as document store database
                            
                                MapReduce and SQL GROUP BY
                            
                                Where is the sqoop library directory?
                            
                                Hive outer join: how to change the default NULL value
                            
                                hadoop fs -text file returns "text: Unable to write to output stream."
                            
                                Apache Spark error : Could not connect to akka.tcp://sparkMaster@
                            
                                Yarn container understanding and tuning
                            
                                Is it possible to install Beeline to run Hive queries without installing Hive?
                            
                                How to set gradle path after installing using sdkman
                            
                                Spark/Yarn: File does not exist on HDFS
                            
                                BindException in Hadoop on EC2
                            
                                hadoop failed to build from source

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With