Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing S3 from Spark 2.0

I'm trying to access S3 file from SparkSQL job. I already tried solutions from several posts but nothing seems to work. Maybe because my EC2 cluster runs the new Spark2.0 for Hadoop2.7.

I setup hadoop this way:

sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", secretKey)

I build an uber-jar using sbt assembly using:

name := "test"
version := "0.2.0"
scalaVersion := "2.11.8"

libraryDependencies += "com.amazonaws" % "aws-java-sdk" %   "1.7.4"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll(
    ExclusionRule("com.amazonaws", "aws-java-sdk"),
    ExclusionRule("commons-beanutils")
)

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided"

When I submit my job to the cluster, I always got the following errors:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 172.31.7.246): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1726) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:662) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:446) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:476)

It seems that the driver is able to read from S3 without problem but not the workers/executors... I do not understand why my uberjar is not sufficient.

However, I tried as well without success to configure spark-submit using:

--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3

PS: If I switch to s3n protocol, I got the following exception:

java.io.IOException: No FileSystem for scheme: s3n

like image 355
elldekaa Avatar asked Sep 20 '16 23:09

elldekaa


People also ask

How do I access my S3 bucket from Spark?

If you are using PySpark to access S3 buckets, you must pass the Spark engine the right packages to use, specifically aws-java-sdk and hadoop-aws . It'll be important to identify the right package version to use. As of this writing aws-java-sdk 's 1.7. 4 version and hadoop-aws 's 2.7.

Can Spark access S3 data?

You can access the data stored on Amazon S3 bucket for your Spark job by using a Hadoop S3A Client. For the full list of Hadoop S3A Client configuration options, see Hadoop-AWS module: Integration with Amazon Web Services.

Does Spark support Amazon S3?

With Amazon EMR release version 5.17. 0 and later, you can use S3 Select with Spark on Amazon EMR.


2 Answers

If you want to use s3n:

sc.hadoopConfiguration.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretKey)

Now, regarding the exception, you need to make sure both JARs are on the driver and worker classpaths, and make sure to distribute them to the worker node if you're using Client Mode via the --jars flag:

spark-submit \
--conf "spark.driver.extraClassPath=/location/to/aws-java-sdk.jar" \
--conf "spark.driver.extraClassPath=/location/to/hadoop-aws.jar" \
--jars /location/to/aws-java-sdk.jar,/location/to/hadoop-aws.jar \

Also, if you're building your uber JAR and including aws-java-sdk and hadoop-aws, no reason to use the --packages flag.

like image 119
Yuval Itzchakov Avatar answered Sep 19 '22 13:09

Yuval Itzchakov


Actually all operations of spark working on workers. and you set these configuration on master so once you can try to app configuration of s3 on mapPartition{ }

like image 43
Sandeep Purohit Avatar answered Sep 19 '22 13:09

Sandeep Purohit