I am working on translating an existing time series database system to MapReduce model using Hadoop. The database system has both historical and real-time processing capabilities. So far, I was able to translate the batch processing functionality into Hadoop.
Unfortunately, when it comes to real-time processing, I see there are some conceptual inconsistencies with the MapReduce model.
I can write my own implementation of Hadoop's InputFormat interface which will continuously feed mappers with new data so that mappers can process and continuously emit data. However, because no reduce() method is being called until all mappers have completed their execution, my computation is bound to be stuck at the mapping stage.
I've seen some posts mentioning mapred.reduce.slowstart.completed.maps
, but, as I understand, this only controls when reducers will start pulling data to their local destinations (shuffling) -- the actual reduce method is called only after all mappers have completed their execution.
Of course, there is an option of mimicking a continuous execution by processing small batches of data over small time-intervals using a continuous stream of separate MR-jobs, but this would introduce additional latencies which is not acceptable in my case.
I've also considered using Storm or S4, but before moving any further I need to make sure that this falls out of the scope of Hadoop.
In summary, it looks like people have been able to develop real-time Hadoop applications (such as Impala) or real-time processing solutions built on top of Hadoop. The question is how?
You are right that the reduce method will never get called if the InputFormat/mappers emit the data continuously. The reason for that is that the reduce method has to iterate over all values for the key, and the full set of values is unknown until the map phase is done, since the value to be given to that reduce method may come from any mapper at any time.
The reduce method accesses values through an iterator, so it is theoretically possible from an API standpoint to call reduce() upfront and make it run perpetually blocking on the iterator until values become available. The reason why this feature is absent from Hadoop is that it would require keeping the context for each key in memory, which does not make sense for batch processing of large data sets.
One way to achieve continuous analysis of the stream of data within the Hadoop MapReduce programming model is to submit a continuous stream of small MR-jobs, with each analyzing a chunk of data. The way to deal with additional latency in that case is to use one of the many available Hadoop accelerators (Disclaimer: I work for a company, ScaleOut Software, who provides such an accelerator: ScaleOut hServer — available in a free community edition). ScaleOut hServer is in-memory MapReduce engine that can run MR jobs in milliseconds. It reuses JVMs between jobs, so the job start up latency is minimal compared to Hadoop. This is a good fit for continuously running MapReduce jobs on chunks of data because it is optimized for real-time performance on the data sets that fit in memory.
One exception to all of the above is if an analysis doesn't require a reduce phase (i.e., the number of reducers is set 0): if the algorithm can be expressed as map-only, it can be done continuously with one Hadoop batch job.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With