Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

KStreams + Spark Streaming + Machine Learning

I'm doing a POC for running Machine Learning algorithm on stream of data.
My initial idea was to take data, use

Spark Streaming --> Aggregate Data from several tables --> run MLLib on Stream of Data --> Produce Output.

But I cam across KStreams. Now I'm confused !!!

Questions :
1. What is difference between Spark Streaming and Kafka Streaming ?
2. How can I marry KStreams + Spark Streaming + Machine Learning ?
3. My idea is to train the test data continuously rather than have batch training..

like image 657
underwood Avatar asked Dec 13 '16 21:12

underwood


3 Answers

I have recently presented at a conference about this topic.

Apache Kafka Streams or Spark Streaming are typically used to apply a machine learning model in real time to new events via stream processing (process data while it is in motion). Matthias answer already discusses their differences.

On the other side, you first use things like Apache Spark MLlib (or H2O.ai or XYZ) to build the analytic models first using historical data sets.

Kafka Streams can be used for online training of models, too. Though, I think online training has various caveats.

All of this is discussed in more details in my slide deck "Apache Kafka Streams and Machine Learning / Deep Learning for Real Time Stream Processing".

like image 185
Kai Wähner Avatar answered Nov 03 '22 01:11

Kai Wähner


First of all, the term "Confluent's Kafka Streaming" is technically not correct.

  1. it's called Kafka's Streams API (aka Kafka Streams)
  2. it's part of Apache Kafka and thus "owned" by the Apache Software Foundation (and not by Confluent)
  3. there is Confluent Open Source and Confluent Enterprise -- two offers from Confluent that both leverage Apache Kafka (and thus, Kafka Streams)

However, Confluent contributes a lot of code to Apache Kafka, including Kafka Streams.

About the differences (I only highlight some main differences and refer to the Internet and documentation for further details: http://docs.confluent.io/current/streams/index.html and http://spark.apache.org/streaming/)

Spark Streaming:

  • micro-batching (no real record-by-record stream processing)
  • no sub-second latency
  • limited window operations
  • no event-time processing
  • processing framework (difficult to operate and to deploy)
  • part of Apache Spark -- a data processing framework
  • exactly-once processing

Kafka Streams

  • record-by-record stream processing
  • ms latency
  • rich window operations
  • stream/table duality
  • event time, ingestion time, and processing time semantics
  • Java library (easy to run and deploy -- it's just a Java application as any other)
  • part of Apache Kafka -- a Stream Processing Platform (ie, it offers storage and processing at once)
  • at-least-once processing (exactly-once processing is WIP; cf KIP-98 and KIP-129)
  • elastic, ie, dynamically scalable

Thus there is no reasons to "marry" both -- it's a question of choice which one you want to use.

My personal take is, that Spark is not a good solution for stream processing. If you want to use a library like Kafka Streams or a framework like Apache Flink, Apache Storm, or Apache Apex (which are all good option for stream processing) depends on your use case (and maybe personal taste) and cannot be answered on SO.

A main differentiator of Kafka Streams is, that it is a library and does not require a processing cluster. And because it is part of Apache Kafka and if you have Apache Kafka already in place, this might simplify your overall deployment as you do not need to run an extra processing cluster.

like image 30
Matthias J. Sax Avatar answered Nov 03 '22 01:11

Matthias J. Sax


Spark Streaming and KStreams in one pic from stream processing point of view.

Spark and KStreams

Highlighted the significant advantages of Spark Streaming and KStreams here to make answer short.

Spark Streaming Advantages over KStreams:

  1. Easy to integrate Spark ML models and Graph computing in same application without writing data outside of an application which means you will process the much quicker than writing kafka again and process.
  2. Join non streaming sources like files system and other non kafka sources with other stream sources in same application.
  3. Messages with Schema can be easily processed with most favorite SQL (StructuredStreaming).
  4. Possible to do graph analysis over streaming data with GraphX inbuilt library.
  5. Spark apps can be deployed over (if) existing YARN or Mesos cluster.

KStreams Advantages:

  1. Compact library for ETL processing and ML model serving/training on messages with rich features. So far, both source and target should be Kafka topic only.
  2. Easy to achieve exactly once semantics.
  3. No separate processing cluster required.
  4. Easy to deploy on docker since it's a plain java application to run.
like image 41
mrsrinivas Avatar answered Nov 03 '22 00:11

mrsrinivas