Why do we need kafka to feed data to apache spark

Tags:

I am reading about spark and its real-time stream processing.I am confused that If spark can itself read stream from source such as twitter or file, then Why do we need kafka to feed data to spark? It would be great if someone explains me what advantage we get if we use spark with kafka. Thank you.

289

asked Mar 08 '17 14:03

Waqar Ahmed

1 Answers

Kafka offers a decoupling and buffering of your input stream.

Take Twitter data for example, afaik you connect to the twitter api and get a constant stream of tweets that match criteria you specified. If you now shut down your Spark jobs for an hour do to some mainentance on your servers or roll out a new version, then you will miss tweets from that hour.

Now imagine you put Kafka in front of your Spark jobs and have a very simple ingest thread that does nothing but connect to the api and write tweets to Kafka, where the Spark jobs retrieve them from. Since Kafka persists everything to disc, you can shut down your processing jobs, perform maintenance and when they are restarted, they will retrieve all data from the time they were offline.

Also, if you change your processing jobs in a significant way and want to reprocess data from the last week, you can easily do that if you have Kafka in your chain (provided you set your retention time high enough) - you'd simply roll out your new jobs and change the offsets in Kafka so that your jobs reread old data and once that is done your data store is up to date with your new processing model.

There is a good article on the general principle written by Jay Kreps, one of the people behind Kafka, give that a read if you want to know more.

174

answered Oct 17 '22 04:10

Sönke Liebau

Related questions
                            
                                Iterating through a Spark RDD
                            
                                Livy Server on Amazon EMR hangs on Connecting to ResourceManager
                            
                                Which HBase connector for Spark 2.0 should I use? [closed]
                            
                                Exporting spark dataframe to .csv with header and specific filename
                            
                                How does Spark paralellize slices to tasks/executors/workers?
                            
                                Standalone spark cluster. Can't submit job programmatically -> java.io.InvalidClassException
                            
                                hadoop writables NotSerializableException with Apache Spark API
                            
                                Access public available Amazon S3 file from Apache Spark
                            
                                how can I access spark javadoc or sources from java project?
                            
                                How to extract a value from a Vector in a column of a Spark Dataframe [duplicate]
                            
                                pyspark add new row to dataframe
                            
                                How to handle small file problem in spark structured streaming?
                            
                                How to mock inner call to pyspark sql function
                            
                                Is Apache Spark good for lots of small, fast computations and a few big, non-interactive ones?
                            
                                spark graphx: how to travers a graph to create a graph of second degree neighbors
                            
                                Running Spark on YARN in yarn-cluster mode: Where does the console output go?
                            
                                Spark CollectAsMap
                            
                                Performing lookup/translation in a Spark RDD or data frame using another RDD/df
                            
                                Why does my Spark run slower than pure Python? Performance comparison
                            
                                How to define a global read\write variables in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do we need kafka to feed data to apache spark

Tags:

apache-kafka

apache-spark

streaming

spark-streaming

Waqar Ahmed

People also ask

1 Answers

Sönke Liebau

Recent Activity

Donate For Us