I have 2 kafka topics - <code>recommendations</code> and <code>clicks</code>. The first topic has recommendations object keyed by a unique Id (called <code>recommendationsId</code>). Each product has a URL which the user can click. The <code>clicks</code> topic gets the messages generated by clicks on those product URLs recommended to the user. It has been so set up that these click messages are also keyed by the <code>recommendationId</code>. Note that <ol> <li>relationship between recommendations and clicks is one-to-many. A recommendations may lead to multiple clicks but a click is always associated with a single recommendation.</li> <li>each click object would have a corresponding recommendations object. </li> <li>a click object would have a timestamp later than the recommendations object.</li> <li>the gap between a recommendation and the corresponding click(s) could be a few seconds to a few days (say, 7 days at the most).</li> </ol> My goal is to join these two topics using Kafka streams join. What I am not clear about is whether I should use a KStream x KStream join or a KStream x KTable join. I implemented the <code>KStream x KTable</code> join by joining <code>clicks</code> stream by <code>recommendations</code> table. However, I am not able to see any joined clicks-recommendations pair if the recommendations were generated before the joiner was started and the click arrives after the joiner started. Am I using the right join? Should I be using <code>KStream x KStream</code> join? If so, in order to be able to join a click with a recommendation at most 7 days in the past, should I set the window size to 7 days? Do I also need to set the "retention" period in this case? My code to perform <code>KStream x KTable</code> join is as follows. Note that I have defined classes <code>Recommendations</code> and <code>Click</code> and their corresponding serde. The click messages are just plain <code>String</code> (url). This URL String is joined with <code>Recommendations</code> object to create a <code>Click</code> object which is emitted to the <code>jointTopic</code>. <pre class="prettyprint"><code>public static void main(String[] args){ if(args.length!=4){ throw new RuntimeException("Expected 3 params: bootstraplist clickTopic recsTopic jointTopic"); } final String booststrapList = args[0]; final String clicksTopic = args[1]; final String recsTopic = args[2]; final String jointTopic = args[3]; Properties config = new Properties(); config.put(StreamsConfig.APPLICATION_ID_CONFIG, "my_joiner_id"); config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, booststrapList); config.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName()); config.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, JoinSerdes.CLICK_SERDE.getClass().getName()); KStreamBuilder builder = new KStreamBuilder(); // load clicks as KStream KStream<String, String> clicksStream = builder.stream(Serdes.String(), Serdes.String(), clicksTopic); // load recommendations as KTable KTable<String, Recommendations> recsTable = builder.table(Serdes.String(), JoinSerdes.RECS_SERDE, recsTopic); // join the two KStream<String, Click> join = clicksStream.leftJoin(recsTable, (click, recs) -> new Click(click, recs)); // emit the join to the jointTopic join.to(Serdes.String(), JoinSerdes.CLICK_SERDE, jointTopic); // let the action begin KafkaStreams streams = new KafkaStreams(builder, config); streams.start(); } </code></pre> This works fine as long as both recommendations and clicks have been generated after the joiner (the above program) is run. If, however, a click arrives for which the recommendation was generated before the joiner was run, I don't see any join happening. How do I fix this? If the solution is to use <code>KStream x KSTream</code> join, then please help me understand what window size I should select and what retention period to select.

Your overall observation is correct. Conceptually, you can get the correct result both ways. If you use stream-table join, you have two disadvantages (this might be revisited and improved in future release of Kafka though) <ul> <li>You mentioned already that if a click get's processed before the corresponding recommendation, the (inner-)join will fail. However, as you know that there will be recommendation, you could use a left-join instead of inner-join, check the join result, and write the click event back to the input topic if the recommendation was <code>null</code> (ie, you get a retry logic) -- or course, consecutive clicks for a single recommendation might get out of order and you might need to account for this in you application code.</li> <li>A second disadvantage of <code>KTable</code> would be, that it will grow forever and unbounded over time, as you will add more and more unique recommendations to it. Thus, you will need to implement some "expiration logic" by sending tombstones records of the form <code><recommendationsId, null></code> to the recommendation topic to delete old recommendations you don't care about any longer.</li> <li>The advantage of this approach is, that you will need less memory/disk space in total, compared to a stream-stream join, because you only need to buffer all recommendations in you application (but no clicks).</li> </ul> If you use a stream-stream join, and a click can happen 7 days after a recommendation, your window size must be 7 days -- otherwise, the click would not join with the recommendation. <ul> <li>The disadvantage of this approach is, that you will need much more memory/disk as you will buffer all clicks and all recommendations of the last 7 days in your applications.</li> <li>The advantage is, that the order or processing (ie, recommendation vs click) does not matter anymore (ie, you don't need to implement the retry strategy as describes above)</li> <li>Furthermore, old recommendations will outdate automatically and thus you don't need to implement special "expiration logic".</li> </ul> For stream-stream join the retention time answer is a little different. It must be at lease 7 days, as the window size is 7 days. Otherwise, you would delete records of your "running window". You can also set the retention period longer, to be able to process "late data". Assume a user clicks at the end the window timeframe (5 minute before the 7 day time span of the recommendation ends), but the click is only reported 1 hour later to your application. If your retention period is 7 days as your window size, this late arriving record cannot be processed anymore (as the recommendation would have been deleted already). If you set a larger retention period of, e.g., 8 days you still can process late records. It depends on you application/semantical need what retention time you want to use. Summary: From an implementation point of view, using stream-stream join is simpler than using stream-table join. However, memory/disk savings are expected and could be large depending on your click stream data rate.

Kafka stream join

Tags:

java

join

java-8

apache-kafka-streams

I have 2 kafka topics - recommendations and clicks. The first topic has recommendations object keyed by a unique Id (called recommendationsId). Each product has a URL which the user can click.

The clicks topic gets the messages generated by clicks on those product URLs recommended to the user. It has been so set up that these click messages are also keyed by the recommendationId.

Note that

relationship between recommendations and clicks is one-to-many. A recommendations may lead to multiple clicks but a click is always associated with a single recommendation.
each click object would have a corresponding recommendations object.
a click object would have a timestamp later than the recommendations object.
the gap between a recommendation and the corresponding click(s) could be a few seconds to a few days (say, 7 days at the most).

My goal is to join these two topics using Kafka streams join. What I am not clear about is whether I should use a KStream x KStream join or a KStream x KTable join.

I implemented the KStream x KTable join by joining clicks stream by recommendations table. However, I am not able to see any joined clicks-recommendations pair if the recommendations were generated before the joiner was started and the click arrives after the joiner started.

Am I using the right join? Should I be using KStream x KStream join? If so, in order to be able to join a click with a recommendation at most 7 days in the past, should I set the window size to 7 days? Do I also need to set the "retention" period in this case?

My code to perform KStream x KTable join is as follows. Note that I have defined classes Recommendations and Click and their corresponding serde. The click messages are just plain String (url). This URL String is joined with Recommendations object to create a Click object which is emitted to the jointTopic.

public static void main(String[] args){
    if(args.length!=4){
      throw new RuntimeException("Expected 3 params: bootstraplist clickTopic recsTopic jointTopic");
    }

    final String booststrapList = args[0];
    final String clicksTopic = args[1];
    final String recsTopic = args[2];
    final String jointTopic = args[3];

    Properties config = new Properties();
    config.put(StreamsConfig.APPLICATION_ID_CONFIG, "my_joiner_id");
    config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, booststrapList);
    config.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
    config.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, JoinSerdes.CLICK_SERDE.getClass().getName());

    KStreamBuilder builder = new KStreamBuilder();

    // load clicks as KStream
    KStream<String, String> clicksStream = builder.stream(Serdes.String(), Serdes.String(), clicksTopic);

    // load recommendations as KTable
    KTable<String, Recommendations> recsTable = builder.table(Serdes.String(), JoinSerdes.RECS_SERDE, recsTopic);

    // join the two
    KStream<String, Click> join = clicksStream.leftJoin(recsTable, (click, recs) -> new Click(click, recs));

    // emit the join to the jointTopic
    join.to(Serdes.String(), JoinSerdes.CLICK_SERDE, jointTopic);

    // let the action begin
    KafkaStreams streams = new KafkaStreams(builder, config);
    streams.start();
  }

This works fine as long as both recommendations and clicks have been generated after the joiner (the above program) is run. If, however, a click arrives for which the recommendation was generated before the joiner was run, I don't see any join happening. How do I fix this?

If the solution is to use KStream x KSTream join, then please help me understand what window size I should select and what retention period to select.

733

asked Sep 25 '17 19:09

Nik

Video Answer

1 Answers

Your overall observation is correct. Conceptually, you can get the correct result both ways. If you use stream-table join, you have two disadvantages (this might be revisited and improved in future release of Kafka though)

You mentioned already that if a click get's processed before the corresponding recommendation, the (inner-)join will fail. However, as you know that there will be recommendation, you could use a left-join instead of inner-join, check the join result, and write the click event back to the input topic if the recommendation was null (ie, you get a retry logic) -- or course, consecutive clicks for a single recommendation might get out of order and you might need to account for this in you application code.
A second disadvantage of KTable would be, that it will grow forever and unbounded over time, as you will add more and more unique recommendations to it. Thus, you will need to implement some "expiration logic" by sending tombstones records of the form <recommendationsId, null> to the recommendation topic to delete old recommendations you don't care about any longer.
The advantage of this approach is, that you will need less memory/disk space in total, compared to a stream-stream join, because you only need to buffer all recommendations in you application (but no clicks).

If you use a stream-stream join, and a click can happen 7 days after a recommendation, your window size must be 7 days -- otherwise, the click would not join with the recommendation.

The disadvantage of this approach is, that you will need much more memory/disk as you will buffer all clicks and all recommendations of the last 7 days in your applications.
The advantage is, that the order or processing (ie, recommendation vs click) does not matter anymore (ie, you don't need to implement the retry strategy as describes above)
Furthermore, old recommendations will outdate automatically and thus you don't need to implement special "expiration logic".

For stream-stream join the retention time answer is a little different. It must be at lease 7 days, as the window size is 7 days. Otherwise, you would delete records of your "running window". You can also set the retention period longer, to be able to process "late data". Assume a user clicks at the end the window timeframe (5 minute before the 7 day time span of the recommendation ends), but the click is only reported 1 hour later to your application. If your retention period is 7 days as your window size, this late arriving record cannot be processed anymore (as the recommendation would have been deleted already). If you set a larger retention period of, e.g., 8 days you still can process late records. It depends on you application/semantical need what retention time you want to use.

Summary: From an implementation point of view, using stream-stream join is simpler than using stream-table join. However, memory/disk savings are expected and could be large depending on your click stream data rate.

107

answered Sep 26 '22 05:09

Matthias J. Sax

Related questions
                            
                                Log4j logging twice with different formats
                            
                                Optional injection Dagger 2
                            
                                Java 9 - REST with Spring 5 & Jigsaw - Is it possible?
                            
                                How to read file from src/main/resources with annotation processor?
                            
                                How to document args in Java main
                            
                                How to retrieve the key with a maximum value in a TreeMap in Java?
                            
                                Spring-Boot one @Scheduled task using multiple cron expressions from yaml file
                            
                                Jaxb UnMarshal Error : unexpected element (uri:"", local:"processedSalesOrderTypeList"). Expected elements are
                            
                                What is http-remoting Protocol
                            
                                Apache Kafka order windowed messages based on their value
                            
                                Enforce constraints on @Value annotated field in Spring Boot application
                            
                                Why Stream<T> collect method returns different key order?
                            
                                log4j2 ERROR Unrecognized format specifier [t]
                            
                                GraphQL: How to implement pagination with graphQL-java?
                            
                                passing POST request body through Amazon API Gateway to Lambda
                            
                                Java - List<Integer> sort, comparator and overflow
                            
                                Does any function like beforeCrash() or beforeExit() exist in Tomcat or Java
                            
                                java.lang.NoClassDefFoundError: Failed resolution of: Landroid/support/v4/content/LocalBroadcastManager only on Build APK
                            
                                Spring boot 1.5.2 - web application stops after loading logo?
                            
                                Set capability on already running selenium webdriver

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With