Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Hadoop Spilling happens?

I am very new to the Hadoop system and in learning phase.

One thing i noticed in Shuffle and Sort phase that Spill will happen whenever the MapOutputBuffer reaches 80% ( i think this can also be configurable).

Now why spilling phase is required ?

Is it because MapOutputBuffer is a circular buffer and if we don't empty it than it may cause data overwrite and memory leak?

like image 388
TalentTuner Avatar asked Jan 11 '15 19:01

TalentTuner


People also ask

What is spilling in Hadoop?

A spill is when a mapper's output exceeds the amount of memory which was allocated for the MapReduce task. Spilling happens when there is not enough memory to fit all the mapper output.

What happens when the IO sort spill percent?

sort. spill. percent, which has the default value 0.80, or 80%), a background thread will start to spill the contents to disk. Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer fills up during this time, the map will block until the spill is complete.

What happens if all the key value pairs output by a mapper do not fit into the memory of the Mapper?

mb (defaults to 100MB) – it is the total amount of memory allowed for the map output to occupy. If you do not fit into this amount, your data would be spilled to the disk.

What happens in MapReduce?

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.


1 Answers

I've written a good article that covers this topic: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/

In general:

  • Spilling happens when there is not enough memory to fit all the mapper output. Amount of memory available for this is set by mapreduce.task.io.sort.mb
  • It happens when 80% of the buffer space occupied because the spilling is done in a separate thread, not to interfere with mapper. If the buffer reaches 100% utilization, the mapper thread has to stop and wait for the spilling thread to free up the space. To avoid this, the threshold of 80% is chosen
  • Spilling happens at least once, when the mapper finished, because the output of the mapper should be sorted and saved to the disk for reducer processes to read it. And there is no use to invent a separate function to the last "save to disk", because in general it does the same task
like image 189
0x0FFF Avatar answered Nov 09 '22 09:11

0x0FFF