I have started reading about Big Data and Hadoop, so this question may sound very stupid to you.
This is what I know.
Each mapper processes a small amount of data and produces an intermediate output. After this, we have the step of shuffle and sort.
Now, Shuffle = Moving intermediate output over to respective Reducers each dealing with a particular key/keys.
So, can one Data Node have the Mapper and Reducer code running in them or we have different DNs for each?
Terminology: Datanodes are for HDFS (storage). Mappers and Reducers (compute) run on nodes that have the TaskTracker daemon on them.
The number of mappers and reducers per tasktracker are controlled by the configs: mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum
Subject to other limits in other configs, theoretically, as long as the tasktracker doesn't have the maximum number of map or reduce tasks, it may get assigned more map or reduce tasks by the jobtracker. Typically the jobtracker will try to assign tasks to reduce the amount of data movement.
So, yes, you can have mappers and reducers running on the same node at the same time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With