I am reading about spark
and its real-time stream
processing.I am confused that If spark
can itself read stream from source such as twitter or file, then Why do we need kafka
to feed data to spark
? It would be great if someone explains me what advantage we get if we use spark
with kafka
. Thank you.
The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.
Why would you use Kafka? Kafka is used to build real-time streaming data pipelines and real-time streaming applications. A data pipeline reliably processes and moves data from one system to another, and a streaming application is an application that consumes streams of data.
Kafka -> External Systems ('Kafka -> Database' or 'Kafka -> Data science model'): Typically, any streaming library (Spark, Flink, NiFi etc) uses Kafka for a message broker.
This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka high-level consumer API. As with all receivers, the data received from Kafka through a Receiver is stored in Spark executors, and then jobs launched by Spark Streaming processes the data.
Kafka offers a decoupling and buffering of your input stream.
Take Twitter data for example, afaik you connect to the twitter api and get a constant stream of tweets that match criteria you specified. If you now shut down your Spark jobs for an hour do to some mainentance on your servers or roll out a new version, then you will miss tweets from that hour.
Now imagine you put Kafka in front of your Spark jobs and have a very simple ingest thread that does nothing but connect to the api and write tweets to Kafka, where the Spark jobs retrieve them from. Since Kafka persists everything to disc, you can shut down your processing jobs, perform maintenance and when they are restarted, they will retrieve all data from the time they were offline.
Also, if you change your processing jobs in a significant way and want to reprocess data from the last week, you can easily do that if you have Kafka in your chain (provided you set your retention time high enough) - you'd simply roll out your new jobs and change the offsets in Kafka so that your jobs reread old data and once that is done your data store is up to date with your new processing model.
There is a good article on the general principle written by Jay Kreps, one of the people behind Kafka, give that a read if you want to know more.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With