I'm trying to understand the boundaries of hadoop and map/reduce and it would help to know a non-trivial problem, or class of problems, that we know map/reduce can't assist in.
It certainly would be interesting if changing one factor of the problem would allow simplification from map/reduce.
Thank you
MapReduce is not able to execute recursive or iterative jobs inherently [12]. Total batch behavior is another problem. All of the input must be ready before the job starts and this prevents MapReduce from online and stream processing use cases.
MapReduce uses coarse-grained tasks to do its work, which are too heavyweight for iterative algorithms. Another problem is that MapReduce has no awareness of the total pipeline of Map plus Reduce steps, so it can't cache intermediate data in memory for faster performance.
Anything that involves doing operations on a large set of data, where the problem can be broken down into smaller independent sub-problems who's results can then be aggregated to produce the answer to the larger problem. A trivial example would be calculating the sum of a huge set of numbers.
Two things come to mind:
Anything that requires real-time / interactive / low latency response times. There is a fixed cost incurred for any job submitted to Hadoop.
Any problem that is not embarrassingly parallel. Hadoop can handle a lot of problems that require some simple interdependency between data, since records are joined during the reduce phase. However, certain graph processing and machine learning algorithms are difficult to write in Hadoop because there are too many operations that are dependent on one another. Some machine learning algorithms require very low latency, random access to a large set of data, which Hadoop does not provide out of the box.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With