I am very new to the Hadoop system and in learning phase. One thing i noticed in Shuffle and Sort phase that Spill will happen whenever the MapOutputBuffer reaches 80% ( i think this can also be configurable). Now why spilling phase is required ? Is it because MapOutputBuffer is a circular buffer and if we don't empty it than it may cause data overwrite and memory leak?

I've written a good article that covers this topic: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/ In general: <ul> <li>Spilling happens when there is not enough memory to fit all the mapper output. Amount of memory available for this is set by <code>mapreduce.task.io.sort.mb</code> </li> <li>It happens when 80% of the buffer space occupied because the spilling is done in a separate thread, not to interfere with mapper. If the buffer reaches 100% utilization, the mapper thread has to stop and wait for the spilling thread to free up the space. To avoid this, the threshold of 80% is chosen</li> <li>Spilling happens at least once, when the mapper finished, because the output of the mapper should be sorted and saved to the disk for reducer processes to read it. And there is no use to invent a separate function to the last "save to disk", because in general it does the same task</li> </ul>

Why does Hadoop Spilling happens?

1 Answers

I've written a good article that covers this topic: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/

In general:

Spilling happens when there is not enough memory to fit all the mapper output. Amount of memory available for this is set by mapreduce.task.io.sort.mb
It happens when 80% of the buffer space occupied because the spilling is done in a separate thread, not to interfere with mapper. If the buffer reaches 100% utilization, the mapper thread has to stop and wait for the spilling thread to free up the space. To avoid this, the threshold of 80% is chosen
Spilling happens at least once, when the mapper finished, because the output of the mapper should be sorted and saved to the disk for reducer processes to read it. And there is no use to invent a separate function to the last "save to disk", because in general it does the same task

189

answered Nov 09 '22 09:11

0x0FFF

Related questions
                            
                                Impala command to know DB table size
                            
                                "start-all.sh" and "start-dfs.sh" from master node do not start the slave node services?
                            
                                ERROR : User did not initialize spark context
                            
                                How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?
                            
                                Using the Apache Mahout machine learning libraries [closed]
                            
                                How to use Hadoop Streaming with LZO-compressed Sequence Files?
                            
                                How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?
                            
                                Declaring a variable and schema in PIG
                            
                                How do I format and add files to hadoop after it crashed?
                            
                                how to load a tarball to pig
                            
                                How to tackle a BIG DATA Data Mart / Fact Table? ( 240 millions of rows )
                            
                                how to make hive take only specific files as input from hdfs folder
                            
                                Error in setting job.setInputFormatClass in Mapreduce
                            
                                Multiples Hadoop FileSystem instances
                            
                                Twitter Storm v/s Apache Hadoop
                            
                                How to get the current filename in Hadoop Reduce
                            
                                How to configure hosts file for Hadoop ecosystem
                            
                                Mapreduce job fail when submitted from windows machine
                            
                                Pig: Control number of mappers
                            
                                How to Join two tables in Hbase

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does Hadoop Spilling happens?

Tags:

hadoop

mapreduce

TalentTuner

People also ask

1 Answers

0x0FFF

Recent Activity

Donate For Us