Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MapReduce/Aggregate operations in SpringBatch

Is it possible to do MapReduce style operations in SpringBatch?

I have two steps in my batch job. The first step calculates average. The second step compares each value with average to determine another value.

For example, Lets say i have a huge database of Student scores. The first step calculates average score in each course/exam. The second step compares individual scores with average to determine grade based on some simple rule:

  1. A if student scores above average
  2. B if student score is Average
  3. C if student scores below average

Currently my first step is a Sql which selects average and writes it to a table. Second step is a Sql which joins average scores with individual scores and uses a Processor to implement the rule.

There are similar aggregation functions like avg, min used a lot in Steps and I'd really prefer if this can be done in Processors keeping the Sqls as simple as possible. Is there any way to write a Processor which aggregates results across multiple rows based on a grouping criteria and then Writes Average/Min to the Output table once?

This pattern repeats a lot and i'm not looking for a Single processor implementation using a Sql which fetches both average and individual scores.

like image 764
Sathish Avatar asked May 25 '11 06:05

Sathish


1 Answers

It is possible. You do not even need more than one step. Map-Reduce can be implemented in a single step. You can create a step with ItemReader and ItemWriter associated with it. Think of ItemReader -ItemWriter pair as of Map- Reduce. You can achieve the neccessary effect by using custom reader and writer with propper line aggregation. It might be a good idea for your reader/writer to implement Stream interface to guarantee intermediate StepContext save operation by Spring batch.

I tried it just for fun, but i think that it is pointless since your working capacity is limited by single JVM, in other words: you could not reach Hadoop cluster (or other real map reduce implementationns) production environment performance. Also it will be really hard to be scallable as your data size grows.

Nice observation but IMO currently useless for real world tasks.

like image 139
aviad Avatar answered Oct 23 '22 04:10

aviad