I am very new to the Hadoop system and in learning phase.
One thing i noticed in Shuffle and Sort phase that Spill will happen whenever the MapOutputBuffer reaches 80% ( i think this can also be configurable).
Now why spilling phase is required ?
Is it because MapOutputBuffer is a circular buffer and if we don't empty it than it may cause data overwrite and memory leak?
A spill is when a mapper's output exceeds the amount of memory which was allocated for the MapReduce task. Spilling happens when there is not enough memory to fit all the mapper output.
sort. spill. percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.
mb (defaults to 100MB) – it is the total amount of memory allowed for the map output to occupy. If you do not fit into this amount, your data would be spilled to the disk.
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.
I've written a good article that covers this topic: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/
In general:
mapreduce.task.io.sort.mb
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With