Currently I have two Hbase tables (lets call them <code>tableA</code> and <code>tableB</code>). Using a single stage MapReduce job the data in <code>tableA</code> is read processed and saved to <code>tableB</code>. Currently both tables reside on the same HBase cluster. However, I need to relocate <code>tableB</code> to its on cluster. Is it possible to configure a single stage map reduce job in Hadoop to read and write from separate instances of HBase?

It is possible, HBase's CopyTable MapReduce job does it by using <code>TableMapReduceUtil.initTableReducerJob()</code> which allows you to set an alternative quorumAddress in case you need to write to remote clusters: <pre class="prettyprint"><code>public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl) </code></pre> <blockquote> quorumAddress - Distant cluster to write to; default is null for output to the cluster that is designated in hbase-site.xml. Set this String to the zookeeper ensemble of an alternate remote cluster when you would have the reduce write a cluster that is other than the default; e.g. copying tables between clusters, the source would be designated by hbase-site.xml and this param would have the ensemble address of the remote cluster. The format to pass is particular. Pass :: such as server,server2,server3:2181:/hbase. </blockquote> <hr> Another option is to implement your own custom reducer to write to the remote table instead of writing to the context. Something similar to this: <pre class="prettyprint"><code>public static class MyReducer extends Reducer<Text, Result, Text, Text> { protected Table remoteTable; protected Connection connection; @Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); // Clone configuration and provide a new quorum address for the remote cluster Configuration config = HBaseConfiguration.create(context.getConfiguration()); config.set("hbase.zookeeper.quorum","quorum1,quorum2,quorum3"); connection = ConnectionFactory.createConnection(config); // HBase 0.99+ //connection = HConnectionManager.createConnection(config); // HBase <0.99 remoteTable = connection.getTable("myTable".getBytes()); remoteTable.setAutoFlush(false); remoteTable.setWriteBufferSize(1024L*1024L*10L); // 10MB buffer } public void reduce(Text boardKey, Iterable<Result> results, Context context) throws IOException, InterruptedException { /* Write puts to remoteTable */ } @Override protected void cleanup(Context context) throws IOException, InterruptedException { super.cleanup(context); if (remoteTable!=null) { remoteTable.flushCommits(); remoteTable.close(); } if(connection!=null) { connection.close(); } } } </code></pre>

How can I read from one HBase instance but write to another?

Tags:

hadoop

mapreduce

hbase

Currently I have two Hbase tables (lets call them tableA and tableB). Using a single stage MapReduce job the data in tableA is read processed and saved to tableB. Currently both tables reside on the same HBase cluster. However, I need to relocate tableB to its on cluster.

Is it possible to configure a single stage map reduce job in Hadoop to read and write from separate instances of HBase?

473

asked Apr 09 '15 19:04

slayton

1 Answers

It is possible, HBase's CopyTable MapReduce job does it by using TableMapReduceUtil.initTableReducerJob() which allows you to set an alternative quorumAddress in case you need to write to remote clusters:

public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl)

quorumAddress - Distant cluster to write to; default is null for output to the cluster that is designated in hbase-site.xml. Set this String to the zookeeper ensemble of an alternate remote cluster when you would have the reduce write a cluster that is other than the default; e.g. copying tables between clusters, the source would be designated by hbase-site.xml and this param would have the ensemble address of the remote cluster. The format to pass is particular. Pass :: such as server,server2,server3:2181:/hbase.

Another option is to implement your own custom reducer to write to the remote table instead of writing to the context. Something similar to this:

public static class MyReducer extends Reducer<Text, Result, Text, Text> {

    protected Table remoteTable; 
    protected Connection connection;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        super.setup(context);
        // Clone configuration and provide a new quorum address for the remote cluster
        Configuration config = HBaseConfiguration.create(context.getConfiguration());
        config.set("hbase.zookeeper.quorum","quorum1,quorum2,quorum3");
        connection = ConnectionFactory.createConnection(config); // HBase 0.99+
        //connection = HConnectionManager.createConnection(config); // HBase <0.99
        remoteTable = connection.getTable("myTable".getBytes());
        remoteTable.setAutoFlush(false);
        remoteTable.setWriteBufferSize(1024L*1024L*10L); // 10MB buffer
    }

    public void reduce(Text boardKey, Iterable<Result> results, Context context) throws IOException, InterruptedException {
        /* Write puts to remoteTable */
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
        super.cleanup(context);
        if (remoteTable!=null) {
            remoteTable.flushCommits();
            remoteTable.close();
        }
        if(connection!=null) {
            connection.close();
        }
    }
}

181

answered Oct 30 '22 10:10

Rubén Moraleda

Related questions
                            
                                Spark driver disassociated and removed by the master
                            
                                Using hive table over parquet in Pig
                            
                                TIMESTAMP format issue in HIVE
                            
                                Spark: saveAsTextFile() only creating SUCCESS file and no part file when writing to local filesystem
                            
                                Unable to load libhdfs when using pyarrow
                            
                                Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"
                            
                                WARN snappy.LoadSnappy: Snappy native library not loaded
                            
                                Saving garbage collection logs into ${yarn.nodemanager.log-dirs}/application_${appid}/container_${contid} for mappers and reducers on Hadoop Yarn
                            
                                Amazon MapReduce best practices for logs analysis
                            
                                Cross product in MapReduce
                            
                                When using HBase as a source for MapReduce, can I extend TableInputFormatBase to create multiple splits and multiple mappers for each region?
                            
                                Spark Streaming with a dynamic lookup table
                            
                                How to get a spark job's metrics?
                            
                                How to configure logging in Hadoop / HDP components?
                            
                                Python write to hdfs file
                            
                                Should Hadoop FileSystem be closed?
                            
                                Storing data to SequenceFile from Apache Pig
                            
                                How to read files with an offset from Hadoop using Java
                            
                                Pig Script without load
                            
                                what difference between execute a map-reduce job using hadoop and java command

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With