I am creating a system that should handle huge amount of data and I need to understand how the reduce group operator works
I have a dataset where I apply a groupby and subsequently a reduceGroup How does the iterator that is passed to the reduceGroup function behave? is it a lazy iterator that loads data when they are requested or an eager one that prepares all the data in memory when it is created?
i am using the scala api in flink 0.9 milestone1
Flink performs the group-by for a groupReduce using a sort operator. The sort operator receives a certain memory budget for sorting. As long as the data fits into this budget, the sort will happen be in-memory. Otherwise, the sort becomes an external merge-sort and spills to disk. Flink reads the sorted data stream and applies the groupReduce function "on-the-fly". The data of a group is not completely read in-memory before the function is applied. Hence, you can process very large groups if the user-function does not materialize group records itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With