How to process a range of hbase rows using spark?

Question

I am trying to use HBase as a data source for spark. So the first step turns out to be creating a RDD from a HBase table. Since Spark works with hadoop input formats, i could find a way to use all rows by creating an rdd http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase But how do we create a RDD for a range scan ?

All suggestions are welcome.

zsxwing · Accepted Answer

Here is an example of using Scan in Spark:

import java.io.{DataOutputStream, ByteArrayOutputStream}
import java.lang.String
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Base64

def convertScanToString(scan: Scan): String = {
  val out: ByteArrayOutputStream = new ByteArrayOutputStream
  val dos: DataOutputStream = new DataOutputStream(out)
  scan.write(dos)
  Base64.encodeBytes(out.toByteArray)
}

val conf = HBaseConfiguration.create()
val scan = new Scan()
scan.setCaching(500)
scan.setCacheBlocks(false)
conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
rdd.count

You need to add related libraries to the Spark classpath and make sure they are compatible with your Spark. Tips: you can use hbase classpath to find them.

Narendra Parmar · Answer

You can set below conf

 val conf = HBaseConfiguration.create()//need to set all param for habse
 conf.set(TableInputFormat.SCAN_ROW_START, "row2");
 conf.set(TableInputFormat.SCAN_ROW_STOP, "stoprowkey");

this will load rdd only for those reocrds

Roman Kondakov · Answer

Here is a Java example with TableMapReduceUtil.convertScanToString(Scan scan):

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.io.IOException;

public class HbaseScan {

    public static void main(String ... args) throws IOException, InterruptedException {

        // Spark conf
        SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("My App");
        JavaSparkContext jsc = new JavaSparkContext(sparkConf);

        // Hbase conf
        Configuration conf = HBaseConfiguration.create();
        conf.set(TableInputFormat.INPUT_TABLE, "big_table_name");

        // Create scan
        Scan scan = new Scan();
        scan.setCaching(500);
        scan.setCacheBlocks(false);
        scan.setStartRow(Bytes.toBytes("a"));
        scan.setStopRow(Bytes.toBytes("d"));


        // Submit scan into hbase conf
        conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));

        // Get RDD
        JavaPairRDD<ImmutableBytesWritable, Result> source = jsc
                .newAPIHadoopRDD(conf, TableInputFormat.class,
                        ImmutableBytesWritable.class, Result.class);

        // Process RDD
        System.out.println(source.count());
    }
}

How to process a range of hbase rows using spark?

Tags:

java

apache-spark

hadoop

bigdata

amitkarmakar

3 Answers

zsxwing

Narendra Parmar

Roman Kondakov

Recent Activity

Donate For Us

How to process a range of hbase rows using spark?

Tags:

java

apache-spark

hadoop

bigdata

amitkarmakar

3 Answers

zsxwing

Narendra Parmar

Roman Kondakov

Related questions

Recent Activity

Donate For Us