Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lambda Architecture with Apache Spark

I'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for real time (Fast Data) and Cassandra to store the results.

Also, all the datapoints I receive are related to a user session, and therefore, for the batch processing I'm only interested to process the datapoints once the session finishes. So, since I'm using Kafka, the only way to solve this (assuming that all the datapoints are stored in the same topic) is for the batch to fetch all the messages in the topic, and then ignore those that correspond to sessions that have not yet finished.

So, what I'd like to ask is:

  • Is this a good approach to implement the Lambda Architecture? Or should use Haddop and Storm instead? (I can't find information about people using Kafka and Apache Spark for batch processing, Map Reduce)
  • Is there a better approach to solve the user sessions problem?

Thanks.

like image 935
luis.alves Avatar asked Jul 09 '15 18:07

luis.alves


1 Answers

This is a good approach. Using Spark for both the speed and batch layers lets you write the logic once and use it in both contexts.

Concerning your session issue, since you're doing that in batch mode, why not just ingest the data from Kafka into HDFS or Cassandra and then write queries for full sessions there? You could use Spark Streaming's "direct connection" to Kafka to do this.

like image 194
Dean Wampler Avatar answered Sep 30 '22 05:09

Dean Wampler