Accessing S3 from Spark 2.0

Tags:

I'm trying to access S3 file from SparkSQL job. I already tried solutions from several posts but nothing seems to work. Maybe because my EC2 cluster runs the new Spark2.0 for Hadoop2.7.

I setup hadoop this way:

sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", secretKey)

I build an uber-jar using sbt assembly using:

name := "test"
version := "0.2.0"
scalaVersion := "2.11.8"

libraryDependencies += "com.amazonaws" % "aws-java-sdk" %   "1.7.4"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll(
    ExclusionRule("com.amazonaws", "aws-java-sdk"),
    ExclusionRule("commons-beanutils")
)

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided"

When I submit my job to the cluster, I always got the following errors:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 172.31.7.246): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1726) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:662) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:446) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:476)

It seems that the driver is able to read from S3 without problem but not the workers/executors... I do not understand why my uberjar is not sufficient.

However, I tried as well without success to configure spark-submit using:

--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3

PS: If I switch to s3n protocol, I got the following exception:

java.io.IOException: No FileSystem for scheme: s3n

355

asked Sep 20 '16 23:09

elldekaa

2 Answers

If you want to use s3n:

sc.hadoopConfiguration.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretKey)

Now, regarding the exception, you need to make sure both JARs are on the driver and worker classpaths, and make sure to distribute them to the worker node if you're using Client Mode via the --jars flag:

spark-submit \
--conf "spark.driver.extraClassPath=/location/to/aws-java-sdk.jar" \
--conf "spark.driver.extraClassPath=/location/to/hadoop-aws.jar" \
--jars /location/to/aws-java-sdk.jar,/location/to/hadoop-aws.jar \

Also, if you're building your uber JAR and including aws-java-sdk and hadoop-aws, no reason to use the --packages flag.

119

answered Sep 19 '22 13:09

Yuval Itzchakov

Actually all operations of spark working on workers. and you set these configuration on master so once you can try to app configuration of s3 on mapPartition{ }

answered Sep 19 '22 13:09

Sandeep Purohit

Related questions
                            
                                How can I force Flume-NG to process the backlog of events after a sink failed?
                            
                                How to remove an ambari service after they have been added
                            
                                What is the difference between classic, local for mapreduce.framework.name in mapred-site.xml?
                            
                                using pyspark, read/write 2D images on hadoop file system
                            
                                How can I merge spark results files without repartition and copyMerge?
                            
                                spark + hadoop data locality
                            
                                How to filter out rows with NaN values in Hive?
                            
                                Can somebody give a high-level, simple explanation to a beginner about how Hadoop works?
                            
                                Chaining multiple mapreduce tasks in Hadoop streaming
                            
                                How do I make Hadoop find imported Python modules when using Python UDFs in Pig?
                            
                                MapReduce - How sort reduce output by value
                            
                                Hadoop reducer not being called
                            
                                Getting the Tool Interface warning even though it is implemented
                            
                                hadoop datanode unable to start. "does not contain a valid host:port authority"
                            
                                write an RDD into HDFS in a spark-streaming context
                            
                                Error: E0505 : E0505: App definition
                            
                                Adding hive jars permanently
                            
                                Spark-Hadoop-> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist
                            
                                cant find start-all.sh in hadoop installation
                            
                                Spark - How many Executors and Cores are allocated to my spark job

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Accessing S3 from Spark 2.0

Tags:

amazon-s3

apache-spark

hadoop

elldekaa

People also ask

2 Answers

Yuval Itzchakov

Sandeep Purohit

Recent Activity

Donate For Us