I have build a Spark and Flink k-means application. My test case is a clustering on 1 million points on a 3 node cluster.
When in-memory bottlenecks begin, Flink starts to outsource to disk and work slowly but works. However, Spark lose executers if the memory is full and starts again (infinite loop?).
I try to customize the memory setting with the help from the mailing list here, thanks. But Spark does still not work.
Is it necessary to have any configurations to be set? I mean Flink works with low memory, Spark must also be able to; or not?
The main reason for this is its stream processing feature, which manages to process rows upon rows of data in real time – which is not possible in Apache Spark's batch processing method. This makes Flink faster than Spark.
Flink's low latency outperforms Spark consistently, even at higher throughput. Spark can achieve low latency with lower throughput, but increasing the throughput will also increase the latency.
This issue is unlikely to have any practical significance on operations unless the use case requires low latency (financial systems) where delay of the order of milliseconds can cause significant impact. That being said, Flink is pretty much a work in progress and cannot stake claim to replace Spark yet.
Flink has become the most popular computing engine in the streaming field. Flink was originally designed to be a big data engine for unified batch and stream computing. Efforts towards this design goal actually started in 2018. To implement this goal, Alibaba established a new and unified API architecture and solution.
I am not a Spark expert (and I am an Flink contributor). As far as I know, Spark is not able to spill to disk if there is not enough main memory. This is one advantage of Flink over Spark. However, Spark announced a new project call "Tungsten" to enable managed memory similar to Flink. I don't know if this feature is already available: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
There are a couple of SO question about Spark out of memory problems (an Internet search with "spark out of memory" yield many results, too):
spark java.lang.OutOfMemoryError: Java heap space Spark runs out of memory when grouping by key Spark out of memory
Maybe one of those help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With