In Samza and Kafka Streams, data stream processing is performed in a sequence/graph (called "dataflow graph" in Samza and "topology" in Kafka Streams) of processing steps (called "job" in Samza" and "processor" in Kafka Streams). I will refer to these two terms as workflow and worker in the remainder of this question.
Let's assume that we have a very simple workflow, consisting of a worker A that consumes sensor measurements and filters all values below 50 followed by a worker B that receives the remaining measurments and filters all values above 80.
Input (Kakfa topic X) --> (Worker A) --> (Worker B) --> Output (Kafka topic Y)
If I have understood
correctly, both Samza and Kafka Streams use the topic partitioning concept for replicating the workflow/workers and thus parallelizing the processing for scalability purposes.
But:
Samza replicates each worker (i.e., job) separately to multiple tasks (one for each partition in the input stream). That is, a task is a replica of a worker of the workflow.
Kafka Streams replicates the whole workflow (i.e., topology) at once to multiple tasks (one for each partition in the input stream). That is, a task is a replica of the whole workflow.
This brings me to my questions:
Assume that there is only one partition: Is it correct, that it is not possible to deploy worker (A) and (B) on two different machines in Kafka Streams while this is possible in Samza? (Or in other words: Is it impossible in Kafka Streams to split a single task (i.e., topology replica) to two machines no matter if there are multiple partitions or not.)
How do two subsequent processors in a Kafka Streams topology (in the same task) communicate? (I know that in Samza all communication between two subsequent workers (i.e., jobs) is done with Kafka topics but since one has to "mark" in Kafka Streams explicitly in the code which streams have to be published as Kafka topics this can't be the case here.)
Is it correct that Samza publishes also all intermediate streams automatically as Kafka topics (and thus makes them available to potential clients) while Kafka Streams only publishes those intermediate and final streams that one marks explicitly (with addSink
in the low-level API and to
or through
in DSL)?
(I'm aware of the fact, that Samza can use also other message queues than Kafka but this is not really relevant for my questions.)
Apache Kafka is the most popular open-source distributed and fault-tolerant stream processing system. Kafka Consumer provides the basic functionalities to handle messages. Kafka Streams also provides real-time stream processing on top of the Kafka Consumer client.
Apache Kafka is a set of tools designed for event streaming. Kafka, Kafka Streams and Kafka Connect are all components in the Kafka project.
Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in an Apache Kafka® cluster. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.
Kafka supports two types of topics: Regular and compacted. Regular topics can be configured with a retention time or a space bound. If there are records that are older than the specified retention time or if the space bound is exceeded for a partition, Kafka is allowed to delete old data to free storage space.
Compared to Pulsar, Apache Kafka has a much larger and more active community because it is more popular and established. Despite the smaller size of the community, Apache Pulsar provides extensive documentation to support developers. In this article, you have learned about the comparative understanding of Apache Pulsar vs Kafka.
Streams of data in Kafka are made up of multiple partitions (based on a key value). A Samza Task consumes a Stream of data and multiple tasks can be executed in parallel to consume all of the partitions in a stream simultaneously. Samza tasks execute in YARN containers.
Apache Samza is based on the concept of a Publish/Subscribe Task that listens to a data stream, processes messages as they arrive and outputs its result to another stream. A stream can be broken into multiple partitions and a copy of the task will be spawned for each partition.
The distribution of tasks among nodes in a cluster (Apache Hadoop YARN) Streams of data in Kafka are made up of multiple partitions (based on a key value). A Samza Task consumes a Stream of data and multiple tasks can be executed in parallel to consume all of the partitions in a stream simultaneously.
First of all, in both Samza and Kafka Streams, you can choose to have an intermediate topic between these two tasks (processors) or not, i.e. the topology can be either:
Input (Kakfa topic X) --> (Worker A) --> (Worker B) --> Output (Kafka topic Y)
or:
Input (Kakfa topic X) --> (Worker A) --> Intermediate (Kafka topic Z) -->(Worker B) --> Output (Kafka topic Y)
In either Samza or Kafka Streams, in the former case you will have to deploy Worker A and B together while in the latter case, you cannot deploy Worker A or B together as in either framework tasks only communicate through intermediate topics, and there is no TCP-based communication channels.
In Samza, for the former case you need to code your two filters as in one task, and for the latter case you need to specify the input and output topic for each of the tasks, e.g. for Worker A input is X and output is Z, for Work B input is Z and output is Y, and you can start / stop the deployed workers independently.
In Kafka Streams, for the former case you can just "concatenate" these processors like
stream1.filter(..).filter(..)
and as a result like Lucas mentioned each result from the first filter will be immediately passed to the second filter (you can think of each input record from topic X traverse the topology in the depth-first ordering, and there is no buffering between any directly connected processors);
And for the latter case you can indicate that the intermediate stream to be "materialized" in another topic, i.e.:
stream1.filter(..).through("topicZ").filter(..)
and each result of the first filter will be sent to the topic Z, which will then be pipelined to the second filter processor. In this case these two filters can potentially be deployed on different hosts or different threads within the same host.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With