Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark fails with NoClassDefFoundError for org.apache.kafka.common.serialization.StringDeserializer

I am developing a generic Spark application that listens to a Kafka stream using Spark and Java.

I am using kafka_2.11-0.10.2.2, spark-2.3.2-bin-hadoop2.7 - I also tried several other kafka/spark combinations before posting this question.

The code fails at loading StringDeserializer class:

 SparkConf sparkConf = new SparkConf().setAppName("JavaDirectKafkaWordCount");
    JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));

    Set<String> topicsSet = new HashSet<>();
    topicsSet.add(topics);
    Map<String, Object> kafkaParams = new HashMap<>();
    kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
    kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
    kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
    kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);

The error I get is:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/StringDeserializer

From Why does Spark application fail with "Exception in thread "main" java.lang.NoClassDefFoundError: ...StringDeserializer"? it seems that this could be a scala version mismatch issue, but my pom.xml doesn't have that issue:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>yyy.iot.ckc</groupId>
<artifactId>sparkpoc</artifactId>
<version>1.0-SNAPSHOT</version>

<name>sparkpoc</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <java.version>1.8</java.version>

    <spark.scala.version>2.11</spark.scala.version>
    <spark.version>2.3.2</spark.version>
</properties>

<dependencies>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.11</version>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${spark.scala.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${spark.scala.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-kafka-0-10_${spark.scala.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>

</dependencies>

<build>
    <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
        <plugins>
            <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
            <plugin>
                <artifactId>maven-clean-plugin</artifactId>
                <version>3.1.0</version>
            </plugin>
            <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
            <plugin>
                <artifactId>maven-resources-plugin</artifactId>
                <version>3.0.2</version>
            </plugin>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.0</version>
            </plugin>
            <plugin>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.22.1</version>
            </plugin>
            <plugin>
                <artifactId>maven-jar-plugin</artifactId>
                <version>3.0.2</version>
            </plugin>
            <plugin>
                <artifactId>maven-install-plugin</artifactId>
                <version>2.5.2</version>
            </plugin>
            <plugin>
                <artifactId>maven-deploy-plugin</artifactId>
                <version>2.8.2</version>
            </plugin>
            <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
            <plugin>
                <artifactId>maven-site-plugin</artifactId>
                <version>3.7.1</version>
            </plugin>
            <plugin>
                <artifactId>maven-project-info-reports-plugin</artifactId>
                <version>3.0.0</version>
            </plugin>
        </plugins>
    </pluginManagement>
</build>
</project>

The submission script I use is:

./bin/spark-submit \
    --class "yyy.iot.ckc.KafkaDataModeler" \
    --master local[2] \
    ../sparkpoc/target/sparkpoc-1.0-SNAPSHOT.jar

Can anyone please point me in the right direction as to where I am going wrong?

like image 958
tinkerbeast Avatar asked Jan 21 '26 13:01

tinkerbeast


2 Answers

Spark runs the program as by running an instance of a JVM. So if the libraries (JARs) are not in the classpath of that JVM we run into this runtime exception. The solution is to package all the dependent JARs along with main JAR. The following build script will work for that.

Also, as mentioned in https://stackoverflow.com/a/54583941/1224075 the scope of the spark-core and spark-streaming libraries need to be declared as provided. This is because some of the libraries are implicitly provided by the Spark JVM.

The build section of the POM which worked for me -

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>2.2.1</version>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>
like image 76
tinkerbeast Avatar answered Jan 23 '26 02:01

tinkerbeast


You need to use the Maven Shade Plugin to package the Kafka clients along with your Spark application, then you can submit the shaded Jar, and the Kafka serializers should be found on the classpath.

Also, make sure you set the provided Spark packages

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_${spark.scala.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_${spark.scala.version}</artifactId>
    <version>${spark.version}</version>
    <scope>provided</scope>
</dependency>
like image 42
OneCricketeer Avatar answered Jan 23 '26 01:01

OneCricketeer