I use <code>org.apache.kafka:kafka-streams:0.10.0.1</code> I'm attempting to work with a time series based stream that doesn't seem to be triggering a <code>KStream.Process()</code> to trigger ("punctuate"). (see here for reference) In a <code>KafkaStreams</code> config I'm passing in this param (among others): <pre class="prettyprint"><code>config.put( StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG, EventTimeExtractor.class.getName()); </code></pre> Here, <code>EventTimeExtractor</code> is a custom timestamp extractor (that implements <code>org.apache.kafka.streams.processor.TimestampExtractor</code>) to extract the timestamp information from JSON data. I would expect this to call my object (derived from <code>TimestampExtractor</code>) when each new record is pulled in. The stream in question is 2 * 10^6 records / minute. I have <code>punctuate()</code> set to 60 seconds and it never fires. I know the data passes this span very frequently since its pulling old values to catch up. In fact it never gets called at all. <ul> <li>Is this the wrong approach to setting timestamps on KStream records?</li> <li>Is this the wrong way to declare this configuration?</li> </ul>

Update Nov 2017: Kafka Streams in Kafka 1.0 now supports <code>punctuate()</code> with both stream-time and with processing-time (wall clock time) behavior. So you can pick whichever behavior you prefer. Your setup seems correct to me. What you need to be aware of: As of Kafka 0.10.0, the <code>punctuate()</code> method operates on stream-time (by default, i.e. based on the default timestamp extractor, stream-time will mean event-time). And the stream-time is only advanced when new data records are coming in, and how much the stream-time is advanced is determined by the associated timestamps of these new records. For example: <ul> <li>Let's assume you have set <code>punctuate()</code> to be called every 1 minute = <code>60 * 1000</code> (note: 1 minute of stream-time). Now, if it happens that no data is being received for the next 5 minutes, <code>punctuate()</code> will not be called at all -- even though you might expect it to be called 5 times. Why? Again, because <code>punctuate()</code> depends on stream-time, and the stream-time is only advanced based on newly received data records.</li> </ul> Might this be causing the behavior you are seeing? Looking ahead: There's already a ongoing discussion in the Kafka project on how to make <code>punctuate()</code> more flexible, e.g. to have trigger it not only based on <code>stream-time</code> (which defaults to <code>event-time</code>) but also based on <code>processing-time</code>.

Your approach seems to be correct. Compare pargraph "Timestamp Extractor (timestamp.extractor):" in http://docs.confluent.io/3.0.1/streams/developer-guide.html#optional-configuration-parameters Not sure, why your custom timestamp extractor is not used. Have a look into <code>org.apache.kafka.streams.processor.internals.StreamTask</code>. In the constructor there should be something like <pre class="prettyprint lang-java prettyprint-override"><code>TimestampExtractor timestampExtractor1 = (TimestampExtractor)config.getConfiguredInstance("timestamp.extractor", TimestampExtractor.class); </code></pre> Check if your custom extractor is picked up there or not...

Kafka - problems with TimestampExtractor

Tags:

java

apache-kafka

apache-kafka-streams

I use org.apache.kafka:kafka-streams:0.10.0.1

I'm attempting to work with a time series based stream that doesn't seem to be triggering a KStream.Process() to trigger ("punctuate"). (see here for reference)

In a KafkaStreams config I'm passing in this param (among others):

config.put(
  StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG,
  EventTimeExtractor.class.getName());

Here, EventTimeExtractor is a custom timestamp extractor (that implements org.apache.kafka.streams.processor.TimestampExtractor) to extract the timestamp information from JSON data.

I would expect this to call my object (derived from TimestampExtractor) when each new record is pulled in. The stream in question is 2 * 10^6 records / minute. I have punctuate() set to 60 seconds and it never fires. I know the data passes this span very frequently since its pulling old values to catch up.

In fact it never gets called at all.

Is this the wrong approach to setting timestamps on KStream records?
Is this the wrong way to declare this configuration?

527

asked Sep 16 '16 15:09

ethrbunny

2 Answers

Update Nov 2017: Kafka Streams in Kafka 1.0 now supports punctuate() with both stream-time and with processing-time (wall clock time) behavior. So you can pick whichever behavior you prefer.

Your setup seems correct to me.

What you need to be aware of: As of Kafka 0.10.0, the punctuate() method operates on stream-time (by default, i.e. based on the default timestamp extractor, stream-time will mean event-time). And the stream-time is only advanced when new data records are coming in, and how much the stream-time is advanced is determined by the associated timestamps of these new records.

For example:

Let's assume you have set punctuate() to be called every 1 minute = 60 * 1000 (note: 1 minute of stream-time). Now, if it happens that no data is being received for the next 5 minutes, punctuate() will not be called at all -- even though you might expect it to be called 5 times. Why? Again, because punctuate() depends on stream-time, and the stream-time is only advanced based on newly received data records.

Might this be causing the behavior you are seeing?

Looking ahead: There's already a ongoing discussion in the Kafka project on how to make punctuate() more flexible, e.g. to have trigger it not only based on stream-time (which defaults to event-time) but also based on processing-time.

answered Sep 23 '22 21:09

Michael G. Noll

Your approach seems to be correct. Compare pargraph "Timestamp Extractor (timestamp.extractor):" in http://docs.confluent.io/3.0.1/streams/developer-guide.html#optional-configuration-parameters

Not sure, why your custom timestamp extractor is not used. Have a look into org.apache.kafka.streams.processor.internals.StreamTask. In the constructor there should be something like

TimestampExtractor timestampExtractor1 = (TimestampExtractor)config.getConfiguredInstance("timestamp.extractor", TimestampExtractor.class);

Check if your custom extractor is picked up there or not...

answered Sep 24 '22 21:09

Matthias J. Sax

Related questions
                            
                                How is this valid Java code? (obfuscated Java)
                            
                                Android creating BitmapDescriptor exception
                            
                                Should I close the InputStream of org.apache.commons.io.IOUtils
                            
                                I am using Selenium Webdriver (Java), Shall I go for ngWebDriver or Protractor?
                            
                                Spring Scheduler in Clustered environment
                            
                                Realm: change field name for migration
                            
                                Why would you need to use more than one constructor?
                            
                                How to concatenate observable lists in JavaFX?
                            
                                Java split(), use whole word containing a specific character as separator
                            
                                Mockito mock a method but use its parameters for the mocked return
                            
                                Oauth2 Client in Spring security
                            
                                How to format YearMonth and MonthDay depending on a Locale?
                            
                                Scala Map and ConcurrentHashMap throw a java.lang.UnsupportedOperationException
                            
                                Maven build failure: package does not exist
                            
                                SonarLint not working for coverage and duplications?
                            
                                is it safe/good practice to "reuse" CompletableFuture
                            
                                What is the penalty for unnecessarily implementing Serializable?
                            
                                How can I get FlowLayout to align JPanels at the bottom like it does for other components?
                            
                                Why in Java 'final String' initialized as String.toString() is not considered as constant [duplicate]
                            
                                Blurry svg image in Android

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With