I'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for real time (Fast Data) and Cassandra to store the results.
Also, all the datapoints I receive are related to a user session, and therefore, for the batch processing I'm only interested to process the datapoints once the session finishes. So, since I'm using Kafka, the only way to solve this (assuming that all the datapoints are stored in the same topic) is for the batch to fetch all the messages in the topic, and then ignore those that correspond to sessions that have not yet finished.
So, what I'd like to ask is:
Thanks.
This is a good approach. Using Spark for both the speed and batch layers lets you write the logic once and use it in both contexts.
Concerning your session issue, since you're doing that in batch mode, why not just ingest the data from Kafka into HDFS or Cassandra and then write queries for full sessions there? You could use Spark Streaming's "direct connection" to Kafka to do this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With