Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do we need kafka to feed data to apache spark

I am reading about spark and its real-time stream processing.I am confused that If spark can itself read stream from source such as twitter or file, then Why do we need kafka to feed data to spark? It would be great if someone explains me what advantage we get if we use spark with kafka. Thank you.

like image 289
Waqar Ahmed Avatar asked Mar 08 '17 14:03

Waqar Ahmed


People also ask

Why Spark is used with Kafka?

The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata.

Why Kafka is required?

Why would you use Kafka? Kafka is used to build real-time streaming data pipelines and real-time streaming applications. A data pipeline reliably processes and moves data from one system to another, and a streaming application is an application that consumes streams of data.

Does Spark use Kafka?

Kafka -> External Systems ('Kafka -> Database' or 'Kafka -> Data science model'): Typically, any streaming library (Spark, Flink, NiFi etc) uses Kafka for a message broker.

How does Spark read data from Kafka?

This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka high-level consumer API. As with all receivers, the data received from Kafka through a Receiver is stored in Spark executors, and then jobs launched by Spark Streaming processes the data.


1 Answers

Kafka offers a decoupling and buffering of your input stream.

Take Twitter data for example, afaik you connect to the twitter api and get a constant stream of tweets that match criteria you specified. If you now shut down your Spark jobs for an hour do to some mainentance on your servers or roll out a new version, then you will miss tweets from that hour.

Now imagine you put Kafka in front of your Spark jobs and have a very simple ingest thread that does nothing but connect to the api and write tweets to Kafka, where the Spark jobs retrieve them from. Since Kafka persists everything to disc, you can shut down your processing jobs, perform maintenance and when they are restarted, they will retrieve all data from the time they were offline.

Also, if you change your processing jobs in a significant way and want to reprocess data from the last week, you can easily do that if you have Kafka in your chain (provided you set your retention time high enough) - you'd simply roll out your new jobs and change the offsets in Kafka so that your jobs reread old data and once that is done your data store is up to date with your new processing model.

There is a good article on the general principle written by Jay Kreps, one of the people behind Kafka, give that a read if you want to know more.

like image 174
Sönke Liebau Avatar answered Oct 17 '22 04:10

Sönke Liebau