Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intermediate Data Spill in Mapreduce (Buffer Memory)

I run a word count job in Hadoop 2.6.0 and I see that there are several spills for the map output.

I have the following configuration:

mapreduce.task.io.sort.mb = 100
mapreduce.map.sort.spill.percent = 0.80

After running the job, the Map output bytes = 222660096. By looking at the container log below, it seems that the buffer size is almost 31055173 bytes not 100MB.

Look at the parameters values before spilling the first spill into the disk bufstart = 0; bufend = 31055173; bufvoid = 104857600

And by dividing the Map output in bytes by bufend (222660096/31055173 = 7.17 (= 8 spills)).

Why is this happening? The buffer size should be 80MB (soft limit), so I think I should have 3 spills only. Also, can anyone give a better explanation for bufstart, bufend and bufvoid?

2016-03-04 20:38:16,678 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://hmaster:9000/input/Input-128M:0+134217728
2016-03-04 20:38:16,842 INFO [main] org.apache.hadoop.mapred.MapTask (EQUATOR) 0 kvi 26214396(104857584)
2016-03-04 20:38:16,842 INFO [main] org.apache.hadoop.mapred.MapTask: mapreduce.task.io.sort.mb: 100
2016-03-04 20:38:16,842 INFO [main] org.apache.hadoop.mapred.MapTask: soft limit at 83886080
2016-03-04 20:38:16,842 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufvoid = 104857600
2016-03-04 20:38:16,842 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396; length = 6553600
2016-03-04 20:38:16,854 INFO [main] org.apache.hadoop.mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2016-03-04 20:38:19,931 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2016-03-04 20:38:19,931 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 31055173; bufvoid = 104857600
2016-03-04 20:38:19,931 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 26214396(104857584); kvend = 13006672(52026688); length = 13207725/6553600
2016-03-04 20:38:19,931 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 41540922 kvi 10385224(41540896)
2016-03-04 20:38:28,178 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 0
2016-03-04 20:38:28,179 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 41540922 kv 10385224(41540896) kvi 7763800(31055200)
2016-03-04 20:38:30,020 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2016-03-04 20:38:30,020 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 41540922; bufend = 72594952; bufvoid = 104857600
2016-03-04 20:38:30,020 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 10385224(41540896); kvend = 23391620(93566480); length = 13208005/6553600
2016-03-04 20:38:30,020 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 83080708 kvi 20770172(83080688)
2016-03-04 20:38:37,042 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 1
2016-03-04 20:38:37,042 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 83080708 kv 20770172(83080688) kvi 18148744(72594976)
2016-03-04 20:38:39,045 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2016-03-04 20:38:39,045 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 83080708; bufend = 9277592; bufvoid = 104857598
2016-03-04 20:38:39,045 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 20770172(83080688); kvend = 7562280(30249120); length = 13207893/6553600
2016-03-04 20:38:39,045 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 19763348 kvi 4940832(19763328)
2016-03-04 20:38:46,376 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 2
2016-03-04 20:38:46,376 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 19763348 kv 4940832(19763328) kvi 2319404(9277616)
2016-03-04 20:38:48,320 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2016-03-04 20:38:48,320 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 19763348; bufend = 50820055; bufvoid = 104857600
2016-03-04 20:38:48,320 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 4940832(19763328); kvend = 17947892(71791568); length = 13207341/6553600
2016-03-04 20:38:48,320 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 61305803 kvi 15326444(61305776)
2016-03-04 20:38:55,301 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 3
2016-03-04 20:38:55,301 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 61305803 kv 15326444(61305776) kvi 12705020(50820080)
2016-03-04 20:38:57,209 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2016-03-04 20:38:57,209 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 61305803; bufend = 92359884; bufvoid = 104857600
2016-03-04 20:38:57,209 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 15326444(61305776); kvend = 2118452(8473808); length = 13207993/6553600
2016-03-04 20:38:57,209 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 102845638 kvi 25711404(102845616)
2016-03-04 20:39:04,710 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 4
2016-03-04 20:39:04,710 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 102845638 kv 25711404(102845616) kvi 23089976(92359904)
2016-03-04 20:39:06,608 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2016-03-04 20:39:06,608 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 102845638; bufend = 29043356; bufvoid = 104857592
2016-03-04 20:39:06,608 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 25711404(102845616); kvend = 12503720(50014880); length = 13207685/6553600
2016-03-04 20:39:06,608 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 39529110 kvi 9882272(39529088)
2016-03-04 20:39:13,669 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 5
2016-03-04 20:39:13,669 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 39529110 kv 9882272(39529088) kvi 7260844(29043376)
2016-03-04 20:39:15,551 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2016-03-04 20:39:15,551 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 39529110; bufend = 70581562; bufvoid = 104857600
2016-03-04 20:39:15,551 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 9882272(39529088); kvend = 22888272(91553088); length = 13208401/6553600
2016-03-04 20:39:15,551 INFO [main] org.apache.hadoop.mapred.MapTask: (EQUATOR) 81067317 kvi 20266824(81067296)
2016-03-04 20:39:16,351 INFO [main] org.apache.hadoop.mapred.MapTask: Starting flush of map output
2016-03-04 20:39:22,600 INFO [SpillThread] org.apache.hadoop.mapred.MapTask: Finished spill 6
2016-03-04 20:39:22,600 INFO [main] org.apache.hadoop.mapred.MapTask: (RESET) equator 81067317 kv 20266824(81067296) kvi 18022368(72089472)
2016-03-04 20:39:22,600 INFO [main] org.apache.hadoop.mapred.MapTask: Spilling map output
2016-03-04 20:39:22,600 INFO [main] org.apache.hadoop.mapred.MapTask: bufstart = 81067317; bufend = 86345178; bufvoid = 104857600
2016-03-04 20:39:22,600 INFO [main] org.apache.hadoop.mapred.MapTask: kvstart = 20266824(81067296); kvend = 18022372(72089488); length = 2244453/6553600
2016-03-04 20:39:23,552 INFO [main] org.apache.hadoop.mapred.MapTask: Finished spill 7
2016-03-04 20:39:23,565 INFO [main] org.apache.hadoop.mapred.Merger: Merging 8 sorted segments
2016-03-04 20:39:23,577 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 8 segments left of total size: 1829496 bytes
2016-03-04 20:39:24,096 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1457141718249_0001_m_000000_0 is done. And is in the process of committing
2016-03-04 20:39:24,222 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1457141718249_0001_m_000000_0' done.
like image 625
Self Avatar asked Oct 19 '22 13:10

Self


1 Answers

bufstart is the starting index of the buffer that is about to be spilled

buffend is the last index of the buffer that would be spilled.

buff void indicates the end of the buffer, till which spill can be done and no further.

You can read more about the usage of indices in the code here.

In addition, mapper uses the sort buffer to sort the keys that it processed and writes it to a intermediate file, that would be fed to the reducer/combiner. As per the behaviour that you state, I think this is happening because you have 7/8 reducers feeding of the mapper input, resulting in the same number of spills. Can you check the reducer logs to confirm this?

like image 58
rahulbmv Avatar answered Oct 27 '22 16:10

rahulbmv