Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterator behaviour in flink reduceGroup

I am creating a system that should handle huge amount of data and I need to understand how the reduce group operator works

I have a dataset where I apply a groupby and subsequently a reduceGroup How does the iterator that is passed to the reduceGroup function behave? is it a lazy iterator that loads data when they are requested or an eager one that prepares all the data in memory when it is created?

i am using the scala api in flink 0.9 milestone1

like image 579
il.bert Avatar asked May 11 '15 14:05

il.bert


1 Answers

Flink performs the group-by for a groupReduce using a sort operator. The sort operator receives a certain memory budget for sorting. As long as the data fits into this budget, the sort will happen be in-memory. Otherwise, the sort becomes an external merge-sort and spills to disk. Flink reads the sorted data stream and applies the groupReduce function "on-the-fly". The data of a group is not completely read in-memory before the function is applied. Hence, you can process very large groups if the user-function does not materialize group records itself.

like image 191
Fabian Hueske Avatar answered Oct 08 '22 21:10

Fabian Hueske