I'm trying to access S3 file from SparkSQL job. I already tried solutions from several posts but nothing seems to work. Maybe because my EC2 cluster runs the new Spark2.0 for Hadoop2.7.
I setup hadoop this way:
sc.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", secretKey)
I build an uber-jar using sbt assembly using:
name := "test"
version := "0.2.0"
scalaVersion := "2.11.8"
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.7.4"
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll(
ExclusionRule("com.amazonaws", "aws-java-sdk"),
ExclusionRule("commons-beanutils")
)
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % "provided"
When I submit my job to the cluster, I always got the following errors:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 172.31.7.246): java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1726) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:662) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:446) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:476)
It seems that the driver is able to read from S3 without problem but not the workers/executors... I do not understand why my uberjar is not sufficient.
However, I tried as well without success to configure spark-submit using:
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
PS: If I switch to s3n protocol, I got the following exception:
java.io.IOException: No FileSystem for scheme: s3n
If you are using PySpark to access S3 buckets, you must pass the Spark engine the right packages to use, specifically aws-java-sdk and hadoop-aws . It'll be important to identify the right package version to use. As of this writing aws-java-sdk 's 1.7. 4 version and hadoop-aws 's 2.7.
You can access the data stored on Amazon S3 bucket for your Spark job by using a Hadoop S3A Client. For the full list of Hadoop S3A Client configuration options, see Hadoop-AWS module: Integration with Amazon Web Services.
With Amazon EMR release version 5.17. 0 and later, you can use S3 Select with Spark on Amazon EMR.
If you want to use s3n
:
sc.hadoopConfiguration.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", accessKey)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", secretKey)
Now, regarding the exception, you need to make sure both JARs are on the driver and worker classpaths, and make sure to distribute them to the worker node if you're using Client Mode via the --jars
flag:
spark-submit \
--conf "spark.driver.extraClassPath=/location/to/aws-java-sdk.jar" \
--conf "spark.driver.extraClassPath=/location/to/hadoop-aws.jar" \
--jars /location/to/aws-java-sdk.jar,/location/to/hadoop-aws.jar \
Also, if you're building your uber JAR and including aws-java-sdk
and hadoop-aws
, no reason to use the --packages
flag.
Actually all operations of spark working on workers. and you set these configuration on master so once you can try to app configuration of s3 on mapPartition{ }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With