Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

Question

It's been a couple of days but I could not download from public Amazon Bucket using Spark :(

Here is spark-shell command:

spark-shell  --master yarn
              -v
              --jars file:/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar,file:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
              --driver-class-path=/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar

Application started and shell waiting for prompt:

   ____              __
  / __/__  ___ _____/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.4.0
   /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val data1 = sc.textFile("s3a://my-bucket-name/README.md")

18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 242.1 KB, free 246.7 MB)
18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.2 KB, free 246.6 MB)
18/12/25 13:06:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-edge01:3545 (size: 24.2 KB, free: 246.9 MB)
18/12/25 13:06:40 INFO SparkContext: Created broadcast 0 from textFile at <console>:24
data1: org.apache.spark.rdd.RDD[String] = s3a://my-bucket-name/README.md MapPartitionsRDD[1] at textFile at <console>:24

scala> data1.count()

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.count(RDD.scala:1168)
... 49 elided
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.fs.StorageStatistics
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 77 more

scala>

All AWS keys, secret-keys was set in hadoop/core-site.xml as described here: Hadoop-AWS module: Integration with Amazon Web Services
The bucket is public - anyone can download (tested with curl -O)
All .jars as you can see was provided by Hadoop itself from /usr/local/hadoop/share/hadoop/tools/lib/ folder
There's no additional settings in spark-defaults.conf - only what was sent in command line

Both jars does not provide this class:

jar tf /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)

jar tf /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar | grep org/apache/hadoop/fs/StorageStatistics
(no result)

What should I do ? Did I forget to add another jar ? What the exact configuration of hadoop-aws and aws-java-sdk-bundle ? versions ?

Alex F · Accepted Answer

Mmmm.... I found the problem, finally..

The main issue is Spark that I have is pre-installed for Hadoop. It's 'v2.4.0 pre-build for Hadoop 2.7 and later'. This is bit of misleading title as you see my struggles with it above. Actually Spark shipped with different version of hadoop jars. The listing from: /usr/local/spark/jars/ shows that it have:

hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....

it only missing: hadoop-aws and aws-java-sdk. I little bit digging in Maven repository: hadoop-aws-v2.7.3 and it dependency: aws-java-sdk-v1.7.4 and voila ! Downloaded those jar and send them as parameters to Spark. Like this:

spark-shell
--master yarn
-v
--jars file:/home/aws-java-sdk-1.7.4.jar,file:/home/hadoop-aws-2.7.3.jar
--driver-class-path=/home/aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar

Did the job !!!

I'm just wondering why all jars from Hadoop (and I send all of them as parameter to --jars and --driver-class-path) didn't catch up. Spark somehow automatically choose it jars and not what I send

zszohar stiro · Answer

I advise you not to do what you did. You are running pre built spark with hadoop 2.7.2 jars on hadoop 2.9.2 and you added to the classpath some more jars to work with s3 from the hadoop 2.7.3 version to solve the issue.

What you should be doing is working with a "hadoop free" spark version - and provide the hadoop file by configuration as you can see in the following link - https://spark.apache.org/docs/2.4.0/hadoop-provided.html

The main parts:

in conf/spark-env.sh

If hadoop binary is on your PATH

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

With explicit path to hadoop binary

export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)

Passing a Hadoop configuration directory

export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)

Dakshin Rajavel · Answer

I use spark 2.4.5 and this is what I did and it worked for me. I am able to connect to AWS s3 from Spark in my local.

(1) Download spark 2.4.5 from here:https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12.tgz. This spark does not have hadoop in it.
(2) Download hadoop. https://archive.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
(3) Update .bash_profile
SPARK_HOME = <SPARK_PATH> #example /home/spark-2.4.5/spark-2.4.5-bin-without-hadoop-scala-2.12
PATH=$SPARK_HOME/bin
(4) Add Hadoop in spark env
Copy spark-env.sh.template as spark-env.sh
add export SPARK_DIST_CLASSPATH=$(<hadoop_path> classpath)
here <hadoop_path> is path to your hadoop /home/hadoop-3.2.1/bin/hadoop

Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

Tags:

amazon-web-services

apache-spark

hadoop

Alex F

3 Answers

Alex F

zszohar stiro

Dakshin Rajavel

Recent Activity

Donate For Us

Hadoop 2.9.2, Spark 2.4.0 access AWS s3a bucket

Tags:

amazon-web-services

apache-spark

hadoop

Alex F

3 Answers

Alex F

zszohar stiro

Dakshin Rajavel

Related questions

Recent Activity

Donate For Us