How to configure Spark 2.4 correctly with user-provided Hadoop

Question

I'd like to use Spark 2.4.5 (the current stable Spark version) and Hadoop 2.10 (the current stable Hadoop version in the 2.x series). Further I need to access HDFS, Hive, S3, and Kafka.

http://spark.apache.org provides Spark 2.4.5 pre-built and bundled with either Hadoop 2.6 or Hadoop 2.7. Another option is to use the Spark with user-provided Hadoop, so I tried that one.

As a consequence of using with user-provided Hadoop, Spark does not include Hive libraries either. There will be an error, like here: How to create SparkSession with Hive support (fails with "Hive classes are not found")?

When I add the spark-hive dependency to the spark-shell (spark-submit is affected as well) by using

spark.jars.packages=org.apache.spark:spark-hive_2.11:2.4.5

in spark-defaults.conf, I get this error:

20/02/26 11:20:45 ERROR spark.SparkContext: 
Failed to add file:/root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar to Spark environment
java.io.FileNotFoundException: Jar /root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1868)
at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:458)
at org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)

because spark-shell cannot handle classifiers together with bundle dependencies, see https://github.com/apache/spark/pull/21339 and https://github.com/apache/spark/pull/17416

A workaround for the classifier probleme looks like this:

$ cp .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2-hadoop2.jar .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar

but DevOps won't accept this.

The complete list of dependencies looks like this (I have added line breaks for better readability)

root@a5a04d888f85:/opt/spark-2.4.5/conf# cat spark-defaults.conf
spark.jars.packages=com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.9.10,
com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.9.10,
org.apache.spark:spark-hive_2.11:2.4.5,
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,
org.apache.hadoop:hadoop-aws:2.10.0,
io.delta:delta-core_2.11:0.5.0,
org.postgresql:postgresql:42.2.5,
mysql:mysql-connector-java:8.0.18,
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,
io.prestosql:presto-jdbc:307

(everything works - except for Hive)

Is the combination of Spark 2.4.5 and Hadoop 2.10 used anywhere? How?
How to combine Spark 2.4.5 with user-provided Hadoop and Hadoop 2.9 or 2.10 ?
Is it necessary to build Spark to get around the Hive dependency problem ?

Beryllium · Accepted Answer

There does not seem to be an easy way to configure Spark 2.4.5 with user-provided Hadoop to use Hadoop 2.10.0

As my task actually was to minimize dependency problems, I have chosen to compile Spark 2.4.5 against Hadoop 2.10.0.

./dev/make-distribution.sh \
  --name hadoop-2.10.0 \
  --tgz \
  -Phadoop-2.7 -Dhadoop.version=hadoop-2.10.0 \
  -Phive -Phive-thriftserver \
  -Pyarn

Now Maven deals with the Hive dependencies/classifiers, and the resulting package is ready to be used.

In my personal opinion compiling Spark is actually easier than configuring the Spark with-user-provided Hadoop.

Integration tests so far have not shown any problems, Spark can access both HDFS and S3 (MinIO).

Update 2021-04-08

If you want to add support for Kubernetes, just add -Pkubernetes to the list of arguments

Samson Scharfrichter · Answer

Assuming you don't want to run Spark-on-YARN -- start from bundle "Spark 2.4.5 with Hadoop 2.7" then cherry-pick the Hadoop libraries to upgrade from bundle "Hadoop 2.10.x"

Discard spark-yarn / hadoop-yarn-* / hadoop-mapreduce-client-* JARs because you won't need them, except hadoop-mapreduce-client-core that is referenced by write operations on HDFS and S3 (cf. "MR commit procedure" V1 or V2)
- you may also discard spark-mesos / mesos-* and/or spark-kubernetes / kubernetes-* JARs depending on what you plan to run Spark on
- you may also discard spark-hive-thriftserver and hive-* JARS if you don't plan to run a "thrift server" instance, except hive-metastore that is necessary for, as you might guess, managing the Metastore (either a regular Hive Metastore service or an embedded Metastore inside the Spark session)
Discard hadoop-hdfs / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl JARs
Replace with hadoop-hdfs-client / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl / stax2-api JARs from Hadoop 2.10 (under common/and common/lib/, or hdfs/ and hdfs/lib/)
Add the S3A connector from Hadoop 2.10 i.e. hadoop-aws / jets3t / woodstox-core JARs (under tools/lib/)
download aws-java-sdk from Amazon (cannot be bundled with Hadoop because it's not an Apache license, I guess)
and finally, run a lot of tests...

That worked for me, after some trial-and-error -- with a caveat: I ran my tests against an S3-compatible storage system, but not against the "real" S3, and not against regular HDFS. And without a "real" Hive Metastore service, just the embedded in-memory & volatile Metastore that Spark runs by default.

For the record, the process is the same with Spark 3.0.0 previews and Hadoop 3.2.1, except that

you also have to upgrade guava
you don't have to upgrade xercesImpl nor htrace-core nor stax2-api
you don't need jets3t any more
you need to retain more hadoop-mapreduce-client-* JARs (probably because of the new "S3 committers")

How to configure Spark 2.4 correctly with user-provided Hadoop

Tags:

apache-spark

hadoop

hadoop2

hive

Beryllium

2 Answers

Beryllium

Samson Scharfrichter

Recent Activity

Donate For Us

How to configure Spark 2.4 correctly with user-provided Hadoop

Tags:

apache-spark

hadoop

hadoop2

hive

Beryllium

2 Answers

Beryllium

Samson Scharfrichter

Related questions

Recent Activity

Donate For Us