I have worked on Storm and Spark but Samza is quite new.
I do not understand why Samza was introduced when Storm is already there for real time processing. Spark provides in memory near real time processing and has other very useful components as graphx and mllib.
What are improvements that Samza brings and what further improvements are possible?
Spark can be of great choice if the Big Data application requires processing a Hadoop MapReduce Job faster. Storm focuses on complex event processing by implementing a fault tolerant method to pipeline different computations on an event as and when they flow into the system.
Apache Storm and Spark are platforms for big data processing that work with real-time data streams. The core difference between the two technologies is in the way they handle data processing. Storm parallelizes task computation while Spark parallelizes data computations.
We currently use Storm as our Twitter realtime data processing pipeline. We have Storm topologies for content filtering, geolocalisation and classification.
Apache Spark and Flink both are next generations Big Data tool grabbing industry attention. Both provide native connectivity with Hadoop and NoSQL Databases and can process HDFS data. Both are the nice solution to several Big Data problems. But Flink is faster than Spark, due to its underlying architecture.
This is a good summary of the differences and pros and cons.
I would just add that Samza, which actually isn't that new, brings a certain simplicity since it is opinionated on the use of Kafka as its backend, while others try to be more generic at the cost of simplicity. Samza is pioneered by the same people who created Kafka, who are also the same people behind the Kappa Architecture--primarily Jay Kreps formerly of LinkedIn. That's pretty cool.
Also, the programming models are totally different between realtime streams with Samza, microbatches in Spark Streaming (which isn't exactly the same as Spark), and spouts and bolts with tuples in Storm.
None of these are "better." It all depends on your use cases, the strengths of your team, how the APIs match up with your mental models, quality of support, etc.
You also forgot Apache Flink and Twitter's Heron, which they made because Storm started to fail them. Then again, very few need to operate at the scale of Twitter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With