What is the key difference between Fork/Join and Map/Reduce?
Do they differ in the kind of decomposition and distribution (data vs. computation)?
The Apache Hadoop is an eco-system which provides an environment which is reliable, scalable and ready for distributed computing. MapReduce is a submodule of this project which is a programming model and is used to process huge datasets which sits on HDFS (Hadoop distributed file system).
MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the Hadoop framework.
The fork/join framework is an implementation of the ExecutorService interface that helps you take advantage of multiple processors. It is designed for work that can be broken into smaller pieces recursively. The goal is to use all the available processing power to enhance the performance of your application.
In parallel computing, the fork–join model is a way of setting up and executing parallel programs, such that execution branches off in parallel at designated points in the program, to "join" (merge) at a subsequent point and resume sequential execution.
One key difference is that F-J seems to be designed to work on a single Java VM, while M-R is explicitly designed to work on a large cluster of machines. These are very different scenarios.
F-J offers facilities to partition a task into several subtasks, in a recursive-looking fashion; more tiers, possibility of 'inter-fork' communication at this stage, much more traditional programming. Does not extend (at least in the paper) beyond a single machine. Great for taking advantage of your eight-core.
M-R only does one big split, with the mapped splits not talking between each other at all, and then reduces everything together. A single tier, no inter-split communication until reduce, and massively scalable. Great for taking advantage of your share of the cloud.
There is a whole scientific paper on the subject, Comparing Fork/Join and MapReduce.
The paper compares the performance, scalability and programmability of three parallel paradigms: fork/join, MapReduce, and a hybrid approach.
What they find is basically that Java fork/join has low startup latency and scales well for small inputs (<5MB), but it cannot process larger inputs due to the size restrictions of shared-memory, single node architectures. On the other hand, MapReduce has significant startup latency (tens of seconds), but scales well for much larger inputs (>100MB) on a compute cluster.
But there is a lot more to read there if you're up for it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With