Lambda Architecture with Apache Spark

Question

I'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for real time (Fast Data) and Cassandra to store the results.

Also, all the datapoints I receive are related to a user session, and therefore, for the batch processing I'm only interested to process the datapoints once the session finishes. So, since I'm using Kafka, the only way to solve this (assuming that all the datapoints are stored in the same topic) is for the batch to fetch all the messages in the topic, and then ignore those that correspond to sessions that have not yet finished.

So, what I'd like to ask is:

Is this a good approach to implement the Lambda Architecture? Or should use Haddop and Storm instead? (I can't find information about people using Kafka and Apache Spark for batch processing, Map Reduce)
Is there a better approach to solve the user sessions problem?

Thanks.

Dean Wampler · Accepted Answer

This is a good approach. Using Spark for both the speed and batch layers lets you write the logic once and use it in both contexts.

Concerning your session issue, since you're doing that in batch mode, why not just ingest the data from Kafka into HDFS or Cassandra and then write queries for full sessions there? You could use Spark Streaming's "direct connection" to Kafka to do this.

Lambda Architecture with Apache Spark

Tags:

cassandra

apache-kafka

apache-spark

lambda-architecture

luis.alves

1 Answers

Dean Wampler

Recent Activity

Donate For Us

Lambda Architecture with Apache Spark

Tags:

cassandra

apache-kafka

apache-spark

lambda-architecture

luis.alves

1 Answers

Dean Wampler

Related questions

Recent Activity

Donate For Us