I'm developing a Java application using Apache Spark. I use this version:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.2</version>
</dependency>
In my code, there is a transitional dependency:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.2</version>
</dependency>
I package my application into a single JAR file. When deploying it on EC2 instance using spark-submit
, I get this error.
Caused by: java.lang.NoSuchFieldError: INSTANCE
at org.apache.http.conn.ssl.SSLConnectionSocketFactory.<clinit>(SSLConnectionSocketFactory.java:144)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:87)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:65)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:58)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:50)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)
This error shows clearly that SparkSubmit
has loaded an older version of the same Apache httpclient library and this conflict happens for this reason.
What is a good way to solve this issue?
For some reason, I cannot upgrade Spark on my Java code. However, I could do that with the EC2 cluster easily. Is it possible to deploy my java code on a cluster with a higher version say 1.6.1 version?
As said in your post, Spark is loading an older version of the httpclient
. The solution is to use the Maven's relocation
facility to produce a neat conflict-free project.
Here's an example of how to use it in your pom.xml
file :
<project>
<!-- Your project definition here, with the groupId, artifactId, and it's dependencies -->
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<relocations>
<relocation>
<pattern>org.apache.http.client</pattern>
<shadedPattern>shaded.org.apache.http.client</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
This will move all files from org.apache.http.client
to shaded.org.apache.http.client
, resolving the conflict.
Original post :
If this is simply a matter of transitive dependencies, you could just add this to your spark-core
dependency to exclude the HttpClient used by Spark :
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.2</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</exclusion>
</exclusions>
</dependency>
I also added the scope
as provided
in your dependency as it will be provided by your cluster.
However, that might muck around with Spark's internal behaviour. If you still get an error after doing this, you could try using Maven's relocation
facility that should produce a neat conflict-free project.
Regarding the fact you can't upgrade Spark's version, did you use exactly this dependency declaration from mvnrepository ?
Spark being backwards compatible, there shouldn't be any problem deploying your job on a cluster with a higher version.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With