Kappa architecture: when insert to batch/analytic serving layer happens

Q: What is the Kappa architecture and how does it differ from the Lambda architecture?

The Kappa Architecture is considered a simpler alternative to the Lambda Architecture as it uses the same technology stack to handle both real-time stream processing and historical batch processing. Both architectures entail the storage of historical data to enable large-scale analytics.

Q: What are Lambda and Kappa architecture and what are they suitable for?

The Modern Data Architecture Solution The first approach is called a Lambda architecture and has two different components: batch processing and stream processing. The second approach is called a Kappa architecture where all data in your environment is treated as a stream.

Q: What is the serving layer in Lambda architecture?

The serving layer is the last component of the batch section of the Lambda Architecture. It's tightly tied to the batch layer because the batch layer is responsible for continually updating the serving layer views. These views will always be out of date due to the high-latency nature of batch computation.

Q: Which of the following is the main reason to use Lambda architecture instead of micro batch streaming or batch pipelines?

This helps to reduce the latency (i.e., the wait time for making data available for analysis) that is inherent in the batch/serving layers. Data consistency. One key idea behind the Lambda Architecture is that it eliminates the risk of data inconsistency that is often seen in distributed systems.

Tags:

architecture

apache-spark

streaming

apache-flink

lambda-architecture

As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.

Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).

When (due to Kappa architecture) I have to insert data to batch serving layer? If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?

Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?

617

asked Oct 15 '19 08:10

VB_

1 Answers

If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.

Now you have two options:

have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.

Late events

Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.

Update events

(as requested by the OP, I will add more details to the update events)

In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.

The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.

Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.

Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.

Mixing the options

It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.

This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.

187

answered Oct 08 '22 20:10

Arvid Heise

Related questions
                            
                                Spark Dataframe Maximum Column Count
                            
                                Run Spark-shell with error :SparkContext: Error initializing SparkContext
                            
                                Spark num-executors
                            
                                Spark SQL: INSERT INTO statement syntax
                            
                                Cannot create temp dir with proper permission: /mnt1/s3
                            
                                Pyspark 1.6 - Aliasing columns after pivoting with multiple aggregates
                            
                                Apache Spark read file as a stream from HDFS
                            
                                "GC overhead limit exceeded" on cache of large dataset into spark memory (via sparklyr & RStudio)
                            
                                spark 2.1.1 : Parsed JSON values do not match with class constructor
                            
                                How can I join a spark live stream with all the data collected by another stream during its entire life cycle?
                            
                                Efficient load CSV coordinate format (COO) input to local matrix spark
                            
                                Spark: Reading big MySQL table into DataFrame fails
                            
                                SparkAppHandle Listener not getting invoked
                            
                                Spark 2.3 dynamic partitionBy not working on S3 AWS EMR 5.13.0
                            
                                KryoException: Unable to find class with spark structured streaming
                            
                                Pyspark and local variables inside UDFs
                            
                                Spark watermark and windowing in Append mode
                            
                                Latent Dirichlet allocation (LDA) in Spark - replicate model
                            
                                Apache Spark Executors Dead - is this the expected behaviour?
                            
                                Spark concurrent writes on same HDFS location

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With