Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Garbage collection tuning in Spark: how to estimate size of Eden?

I am reading about garbage collection tuning in Spark: The Definitive Guide by Bill Chambers and Matei Zaharia. This chapter is largely based on Spark's documentation. Nevertheless, the authors extend the documentation with an example of how to deal with too many minor collections but not many major collections.

Both official documentation and the book state that:

If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden is determined to be E, then you can set the size of the Young generation using the option -Xmn=4/3*E. (The scaling up by 4/3 is to account for space used by survivor regions as well.) (See here)

The book offers an example (Spark: The Definitive Guide, first ed., p. 324):

If your task is reading data from HDFS, the amount of memory used by the task can be estimated by using the size of the data block read from HDFS. Note that the size of a decompressed block is often two or three times the size of the block. So if you want to have three or four tasks' worth of working space, and the HDFS block size is 128 MB, we can estimate size of Eden to be 43,128 MB.

Assuming that each uncompressed block takes even 512 MB and we have 4 tasks, and we scale up by 4/3, I don't really see how you can come up with the estimate of 43,128 MB of memory for Eden.

I would rather answer that ~3 GB should be enough for Eden given the book's assumptions.

Could anyone explain how this estimation should be calculated?

like image 824
Wojciech Walczak Avatar asked Jan 24 '26 21:01

Wojciech Walczak


1 Answers

OK, I think the new Spark docs make it clear:

As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using the size of the data block read from HDFS. Note that the size of a decompressed block is often 2 or 3 times the size of the block. So if we wish to have 3 or 4 tasks’ worth of working space, and the HDFS block size is 128 MB, we can estimate size of Eden to be 4*3*128MB.

So, it's 4*3*128 MB rather than what the book says (i.e. 43,128 MB).

like image 193
Wojciech Walczak Avatar answered Jan 26 '26 15:01

Wojciech Walczak