Apache Kafka: Distributed messaging system
Apache Storm: Real Time Message Processing
How we can use both technologies in a real-time data pipeline for processing event data?
In terms of real time data pipeline both seems to me do the job identical. How can we use both the technologies on a data pipeline?
You use Apache Kafka as a distributed and robust queue that can handle high volume data and enables you to pass messages from one end-point to another.
Storm is not a queue. It is a system that has distributed real time processing abilities, meaning you can execute all kind of manipulations on real time data in parallel.
The common flow of these tools (as I know it) goes as follows:
real-time-system --> Kafka --> Storm --> NoSql --> BI(optional)
So you have your real time app handling high volume data, sends it to Kafka queue. Storm pulls the data from kafka and applies some required manipulation. At this point you usually like to get some benefits from this data, so you either send it to some Nosql db for additional BI calculations, or you could simply query this NoSql from any other system.
Kafka and Storm have a slightly different purpose:
Kafka is a distributed message broker which can handle big amount of messages per second. It uses publish-subscribe paradigm and relies on topics and partitions. Kafka uses Zookeeper to share and save state between brokers. So Kafka is basically responsible for transferring messages from one machine to another.
Storm is a scalable, fault-tolerant, real-time analytic system (think like Hadoop in realtime). It consumes data from sources (Spouts) and passes it to pipeline (Bolts). You can combine them in the topology. So Storm is basically a computation unit (aggregation, machine learning).
But you can use them together: for example your application uses kafka to send data to other servers which uses storm to make some computation on it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With