Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kappa architecture: when insert to batch/analytic serving layer happens

As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.

Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).

When (due to Kappa architecture) I have to insert data to batch serving layer? If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?


Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?

like image 617
VB_ Avatar asked Oct 15 '19 08:10

VB_


People also ask

What is the Kappa architecture and how does it differ from the Lambda architecture?

The Kappa Architecture is considered a simpler alternative to the Lambda Architecture as it uses the same technology stack to handle both real-time stream processing and historical batch processing. Both architectures entail the storage of historical data to enable large-scale analytics.

What are Lambda and Kappa architecture and what are they suitable for?

The Modern Data Architecture Solution The first approach is called a Lambda architecture and has two different components: batch processing and stream processing. The second approach is called a Kappa architecture where all data in your environment is treated as a stream.

What is the serving layer in Lambda architecture?

The serving layer is the last component of the batch section of the Lambda Architecture. It's tightly tied to the batch layer because the batch layer is responsible for continually updating the serving layer views. These views will always be out of date due to the high-latency nature of batch computation.

Which of the following is the main reason to use Lambda architecture instead of micro batch streaming or batch pipelines?

This helps to reduce the latency (i.e., the wait time for making data available for analysis) that is inherent in the batch/serving layers. Data consistency. One key idea behind the Lambda Architecture is that it eliminates the risk of data inconsistency that is often seen in distributed systems.


1 Answers

If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.

Now you have two options:

  • have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
  • with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.

Late events

Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.

Update events

(as requested by the OP, I will add more details to the update events)

In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.

The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.

Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.

Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.

Mixing the options

It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.

This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.

like image 187
Arvid Heise Avatar answered Oct 08 '22 20:10

Arvid Heise