Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where do Apache Samza and Apache Storm differ in their use cases?

I've stumbled upon this article that purports do contrast Samza with Storm, but it seems only to address implementation details.

Where do these two distributed computation engines differ in their use cases? What kind of job is each tool good for?

like image 463
Louis Thibault Avatar asked Mar 17 '15 23:03

Louis Thibault


People also ask

Where is Apache Storm used?

At Metamarkets, Apache Storm is used to process real-time event data streamed from Apache Kafka message brokers, and then to load that data into a Druid cluster, the low-latency data store at the heart of our real-time analytics service.

What is the difference between Kafka and storm?

Kafka uses Zookeeper to share and save state between brokers. So Kafka is basically responsible for transferring messages from one machine to another. Storm is a scalable, fault-tolerant, real-time analytic system (think like Hadoop in realtime). It consumes data from sources (Spouts) and passes it to pipeline (Bolts).

Why use Apache Samza?

Samza provides fault tolerance, isolation and stateful processing. Unlike batch systems such as Apache Hadoop or Apache Spark, it provides continuous computation and output, which result in sub-second response times.

What is Apache Storm vs spark?

Apache Storm and Spark are platforms for big data processing that work with real-time data streams. The core difference between the two technologies is in the way they handle data processing. Storm parallelizes task computation while Spark parallelizes data computations.

What is Apache Samza?

Apache Samza is based on the concept of a Publish/Subscribe Task that listens to a data stream, processes messages as they arrive and outputs its result to another stream. A stream can be broken into multiple partitions and a copy of the task will be spawned for each partition.

What is Apache Storm architecture?

The Apache Storm Architecture is based on the concept of Spouts and Bolts. Spouts are sources of information and push information to one or more Bolts, which can then be chained to other Bolts and the whole topology becomes a DAG. The topology - how the Spouts and Bolts are connected together is explicitly defined by the developer.

How does Samza work with Kafka Streams?

Streams of data in Kafka are made up of multiple partitions (based on a key value). A Samza Task consumes a Stream of data and multiple tasks can be executed in parallel to consume all of the partitions in a stream simultaneously. Samza tasks execute in YARN containers.

How to do a word count example in Apache Storm?

To do a Word Count example in Apache Storm, we need to create a simple Spout which generates sentences to be streamed to a Bolt which breaks up the sentences into words, and then another Bolt which counts word as they flow through. The output at each stage is shown in the diagram below.


2 Answers

Well, I've been investigating these systems for a few months, and I don't think they differ profoundly in their use cases. I think it's best to compare them along these lines instead:

  1. Age: Storm is the older project, and the original one in this space, so it's generally more mature and battle-tested. Samza is a newer, second-generation project that seems informed by lessons that were learned from Storm.
  2. Kafka: Samza grew out of the Kafka ecosystem, and is very Kafka-centric. For example, the documentation says that they allow plugging in different messaging systems... as long as they provide similar partitioning, ordering and replay semantics as Kafka does. Storm, being an older system, isn't so specialized to work with Kafka.
  3. Complexity: Samza, partly because it makes stronger assumptions about its environment ("you can have any infrastructure you like as long as it works like Kafka") and partly because it's just newer, strikes me as generally simpler than Storm, in a good way. But one perhaps less good way that Samza is simpler is that it (deliberately?) lacks Storm's concept of topologies (complex execution graphs). If you need a complex, multi-stage processor, it needs to be implemented as independent tasks that communicate through Kafka. This has advantages as well as disadvantages, but Samza makes the choice for you whereas Storm gives you more options.
  4. State management: Many Storm applications need to use an external store like Redis when they need to maintain a large volume of state to process incoming tuples. This situation seems to be one of the main things that motivated Samza's design; one of Samza's most distinctive features is that it provides its tasks with their own local disk-based key/value store to use for this purpose if they need it.
like image 53
Luis Casillas Avatar answered Sep 29 '22 19:09

Luis Casillas


The biggest difference between Apache Storm and Apache Samza comes down to how they stream data to process it.

Apache Storm conducts real-time computation using topology and it gets feed into a cluster where the master node distributes the code among worker nodes that execute it. In topology data is passed in between spouts that spit out data streams as immutable sets of key-value pairs.

Here's Apache Storm's architecture: enter image description here

Apache Samza streams by processing messages as they come in one at a time. The streams get divided into partitions that are an ordered sequence where each has a unique ID. It supports batching and is typically used with Hadoop's YARN and Apache Kafka.

Here's Apache Samza's architecture: enter image description here

Read more about the specific ways each of the systems executes specifics below.

USE CASE

Apache Samza was created by LinkedIn.

A software engineer wrote a post siting:

It's been in production at LinkedIn for several years and currently runs on hundreds of machines across multiple data centers. Our largest Samza job is processing over 1,000,000 messages per-second during peak traffic hours.

Resources Used:

Storm vs. Samza Comparison

Useful Architectural References of Storm and Samza

like image 32
mprithibi Avatar answered Sep 29 '22 20:09

mprithibi