HBase Mapreduce on multiple scan objects

Question

I am just trying to evaluate HBase for some of data analysis stuff we are doing.

HBase would contain our event data. Key would be eventId + time. We want to run analysis on few events types (4-5) between a date range. Total number of event type is around 1000.

The problem with running mapreduce job on the hbase table is that initTableMapperJob (see below) takes only 1 scan object. For performance reason we want to scan the data for only 4-5 event types in a give date range and not the 1000 event types. If we use the method below then I guess we don't have that choice because it takes only 1 scan object.

public static void initTableMapperJob(String table, Scan scan, Class mapper, Class outputKeyClass, Class outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException

Is it possible to run mapreduce on a list of scan objects? any workaround?

Thanks

Dave L. · Accepted Answer

TableMapReduceUtil.initTableMapperJob configures your job to use TableInputFormat which, as you note, takes a single Scan.

It sounds like you want to scan multiple segments of a table. To do so, you'll have to create your own InputFormat, something like MultiSegmentTableInputFormat. Extend TableInputFormatBase and override the getSplits method so that it calls super.getSplits once for each start/stop row segment of the table. (Easiest way would be to TableInputFormatBase.scan.setStartRow() each time). Aggregate the InputSplit instances returned to a single list.

Then configure the job yourself to use your custom MultiSegmentTableInputFormat.

HBase Mapreduce on multiple scan objects

Tags:

mapreduce

hbase

StackUnderflow

1 Answers

Dave L.

Recent Activity

Donate For Us

HBase Mapreduce on multiple scan objects

Tags:

mapreduce

hbase

StackUnderflow

1 Answers

Dave L.

Related questions

Recent Activity

Donate For Us