Why does Kafka Direct Stream create a new decoder for every message?

Tags:

I have a Spark streaming app written in Java and using Spark 2.1. I am using KafkaUtils.createDirectStream to read messages from Kafka. I am using kryo encoder/decoder for kafka messages. I specified this in Kafka properties-> key.deserializer, value.deserializer, key.serializer, value.deserializer

When Spark pulls the messages in a micro batch, the messages are successfully decoded using kryo decoder. However I noticed that Spark executor creates a new instance of kryo decoder for decoding each message read from kafka. I checked this by putting logs inside the decoder constructor

This seems weird to me. Shouldn't the same instance of decoder be used for each message and each batch?

Code where I am reading from kafka:

JavaInputDStream<ConsumerRecord<String, Class1>> consumerRecords = KafkaUtils.createDirectStream(
        jssc,
        LocationStrategies.PreferConsistent(),
        ConsumerStrategies.<String, Class1>Subscribe(topics, kafkaParams));

JavaPairDStream<String, Class1> converted = consumerRecords.mapToPair(consRecord -> {
    return new Tuple2<String, Class1>(consRecord.key(), consRecord.value());
});

937

asked Dec 15 '17 07:12

scorpio

1 Answers

If we want to see how Spark fetches data from Kafka internally, we'll need to look at KafkaRDD.compute, which is a method implemented for every RDD which tells the framework how to, well, compute that RDD:

override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {
  val part = thePart.asInstanceOf[KafkaRDDPartition]
  assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))
  if (part.fromOffset == part.untilOffset) {
    logInfo(s"Beginning offset ${part.fromOffset} is the same as ending offset " +
    s"skipping ${part.topic} ${part.partition}")
    Iterator.empty
  } else {
    new KafkaRDDIterator(part, context)
  }
}

What's important here is the else clause, which creates a KafkaRDDIterator. This internally has:

val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])
  .newInstance(kc.config.props)
  .asInstanceOf[Decoder[K]]

val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
  .newInstance(kc.config.props)
  .asInstanceOf[Decoder[V]]

Which as you see, creates an instance of both the key decoder and the value decoder via reflection, for each underlying partition. This means that it isn't being generated per message but per Kafka partition.

Why is it implemented this way? I don't know. I'm assuming because a key and value decoder should have a neglectable performance hit compared to all the other allocations happening inside Spark.

If you've profiled your app and found this to be an allocation hot-path, you could open an issue. Otherwise, I wouldn't worry about it.

122

answered Oct 23 '22 05:10

Yuval Itzchakov

Related questions
                            
                                Hibernate, Java 9 and SystemException
                            
                                icns file not found error with javapackager on macOS High Sierra
                            
                                How to get all jars loaded by a Java application in Java9?
                            
                                Is Ahead-Of-Time compilation available in Java 9?
                            
                                Is possible to add annotations or a 'end screen' to a video when i'm uploading it using Java youtube api?
                            
                                How to implement the template method design pattern in Kotlin?
                            
                                Sending Logback trace logs to Azure application insights - error: failed to send, bad request
                            
                                JSON content for POST request @ManyToOne relationship in Java Spring
                            
                                Why subsignature and unchecked rules work this way on return types when overriding a generic method with a non-generic one?
                            
                                Maven Site Plugin with Java9
                            
                                Why Class.forName("BumpTest"), not BumpTest.class?
                            
                                How to correctly use clear(), evict() and close() methods in Hibernate?
                            
                                How to convert a list of strings into a list of objects?
                            
                                Android Studio all gms/firebase libaries must use the exact same version
                            
                                Integration testing of JAX-RS services
                            
                                Eclipse content assist not working in enum constant parameter list
                            
                                Find if a Long ( in a list ) can fit in a List of Long values
                            
                                Parsing double quote from csv using jackson-dataformat-csv
                            
                                Deploying game to server results in strange behaviour
                            
                                Method reference of an object in variable vs. returned by method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does Kafka Direct Stream create a new decoder for every message?

Tags:

java

apache-kafka

apache-spark

spark-streaming

kryo

scorpio

People also ask

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us