Could anyone compare Flink and Spark as platforms for machine learning? Which is potentially better for iterative algorithms? Link to the general Flink vs Spark discussion: What is the difference between Apache Spark and Apache Flink?
Disclaimer: I'm a PMC member of Apache Flink. My answer focuses on the differences of executing iterations in Flink and Spark.
Apache Spark executes iterations by loop unrolling. This means that for each iteration a new set of tasks/operators is scheduled and executed. Spark does that very efficiently because it is very good at low-latency task scheduling (same mechanism is used for Spark streaming btw.) and caches data in-memory across iterations. Therefore, each iteration operates on the result of the previous iteration which is held in memory. In Spark, iterations are implemented as regular for-loops (see Logistic Regression example).
Flink executes programs with iterations as cyclic data flows. This means that a data flow program (and all its operators) is scheduled just once and the data is fed back from the tail of an iteration to its head. Basically, data is flowing in cycles around the operators within an iteration. Since operators are just scheduled once, they can maintain a state over all iterations. Flink's API offers two dedicated iteration operators to specify iterations: 1) bulk iterations, which are conceptually similar to loop unrolling, and 2) delta iterations. Delta iterations can significantly speed up certain algorithms because the work in each iteration decreases as the number of iterations goes on. For example the 10th iteration of a delta iteration PageRank implementation completes much faster than the first iteration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With