Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HBase Mapreduce on multiple scan objects

I am just trying to evaluate HBase for some of data analysis stuff we are doing.

HBase would contain our event data. Key would be eventId + time. We want to run analysis on few events types (4-5) between a date range. Total number of event type is around 1000.

The problem with running mapreduce job on the hbase table is that initTableMapperJob (see below) takes only 1 scan object. For performance reason we want to scan the data for only 4-5 event types in a give date range and not the 1000 event types. If we use the method below then I guess we don't have that choice because it takes only 1 scan object.

public static void initTableMapperJob(String table, Scan scan, Class mapper, Class outputKeyClass, Class outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException

Is it possible to run mapreduce on a list of scan objects? any workaround?

Thanks

like image 316
StackUnderflow Avatar asked May 18 '26 17:05

StackUnderflow


1 Answers

TableMapReduceUtil.initTableMapperJob configures your job to use TableInputFormat which, as you note, takes a single Scan.

It sounds like you want to scan multiple segments of a table. To do so, you'll have to create your own InputFormat, something like MultiSegmentTableInputFormat. Extend TableInputFormatBase and override the getSplits method so that it calls super.getSplits once for each start/stop row segment of the table. (Easiest way would be to TableInputFormatBase.scan.setStartRow() each time). Aggregate the InputSplit instances returned to a single list.

Then configure the job yourself to use your custom MultiSegmentTableInputFormat.

like image 76
Dave L. Avatar answered May 22 '26 08:05

Dave L.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!