Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How could I tell if my hadoop config parameter io.sort.factor is too small or too big?

Tags:

hadoop

After reading http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html we came to the conclusion our 6-nodes hadoop cluster could use some tuning, and io.sort.factor seems to be a good candidate, as it controls an important tradeoff. We're planning on tweaking and testing, but planning ahead and knowing what to expect and what to watch for seems reasonable.

It's currently on 10. How would we know that it's causing us too much merges? When we raise it, how would we know it's causing too much files to be opened?

Note that we can't follow the blog log extracts directly as it's updated to CDH3b2, and we're working on CDH3u2, and they have changed...

like image 856
ihadanny Avatar asked Dec 27 '11 08:12

ihadanny


1 Answers

There are a few tradeoffs to consider.

  1. the number of seeks being done when merging files. If you increase the merge factor too high, then the seek cost on disk will exceed the savings from doing a parallel merge (note that OS cache might mitigate this somewhat).

  2. Increasing the sort factor decreases the amount of data in each partition. I believe the number is io.sort.mb / io.sort.factor for each partition of sorted data. I believe the general rule of thumb is to have io.sort.mb = 10 * io.sort.factor (this is based on the seek latency of the disk on the transfer speed, I believe. I'm sure this could be tuned better if it was your bottleneck. If you keep these in line with each other, then the seek overhead from merging should be minimized

  3. If you increase io.sort.mb, then you increase memory pressure on the cluster, leaving less memory available for job tasks. Memory usage for sorting is mapper tasks * io.sort.mb -- so you could find yourself causing extra GCs if this is too high

Essentially,

If you find yourself swapping heavily, then there's a good chance you have set the sort factor too high.

If the ratio between io.sort.mb and io.sort.factor isn't correct, then you may need to change io.sort.mb (if you have the memory) or lower the sort factor.

If you find that you are spending more time in your mappers than in your reducers, then you may want to increase the number of map tasks and decrease the sort factor (assuming there is memory pressure).

like image 134
D W Avatar answered Sep 26 '22 14:09

D W