I've stumbled upon this article that purports do contrast Samza with Storm, but it seems only to address implementation details.
Where do these two distributed computation engines differ in their use cases? What kind of job is each tool good for?
At Metamarkets, Apache Storm is used to process real-time event data streamed from Apache Kafka message brokers, and then to load that data into a Druid cluster, the low-latency data store at the heart of our real-time analytics service.
Kafka uses Zookeeper to share and save state between brokers. So Kafka is basically responsible for transferring messages from one machine to another. Storm is a scalable, fault-tolerant, real-time analytic system (think like Hadoop in realtime). It consumes data from sources (Spouts) and passes it to pipeline (Bolts).
Samza provides fault tolerance, isolation and stateful processing. Unlike batch systems such as Apache Hadoop or Apache Spark, it provides continuous computation and output, which result in sub-second response times.
Apache Storm and Spark are platforms for big data processing that work with real-time data streams. The core difference between the two technologies is in the way they handle data processing. Storm parallelizes task computation while Spark parallelizes data computations.
Apache Samza is based on the concept of a Publish/Subscribe Task that listens to a data stream, processes messages as they arrive and outputs its result to another stream. A stream can be broken into multiple partitions and a copy of the task will be spawned for each partition.
The Apache Storm Architecture is based on the concept of Spouts and Bolts. Spouts are sources of information and push information to one or more Bolts, which can then be chained to other Bolts and the whole topology becomes a DAG. The topology - how the Spouts and Bolts are connected together is explicitly defined by the developer.
Streams of data in Kafka are made up of multiple partitions (based on a key value). A Samza Task consumes a Stream of data and multiple tasks can be executed in parallel to consume all of the partitions in a stream simultaneously. Samza tasks execute in YARN containers.
To do a Word Count example in Apache Storm, we need to create a simple Spout which generates sentences to be streamed to a Bolt which breaks up the sentences into words, and then another Bolt which counts word as they flow through. The output at each stage is shown in the diagram below.
Well, I've been investigating these systems for a few months, and I don't think they differ profoundly in their use cases. I think it's best to compare them along these lines instead:
The biggest difference between Apache Storm and Apache Samza comes down to how they stream data to process it.
Apache Storm conducts real-time computation using topology and it gets feed into a cluster where the master node distributes the code among worker nodes that execute it. In topology data is passed in between spouts that spit out data streams as immutable sets of key-value pairs.
Here's Apache Storm's architecture:
Apache Samza streams by processing messages as they come in one at a time. The streams get divided into partitions that are an ordered sequence where each has a unique ID. It supports batching and is typically used with Hadoop's YARN and Apache Kafka.
Here's Apache Samza's architecture:
Read more about the specific ways each of the systems executes specifics below.
USE CASE
Apache Samza was created by LinkedIn.
A software engineer wrote a post siting:
Resources Used:
Storm vs. Samza Comparison
Useful Architectural References of Storm and Samza
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With