Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including: deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials add hadoop-aws into maven => various transitive dependency conflicts Has anyone successfully make both work?

Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration. Here are the key parts, as of December 2015: <ol> <li> Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify <code>--hadoop-major-version 2</code> (which uses CDH 4.2 as of this writing). </li> <li> You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything. </li> <li> You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class <code>org.apache.hadoop.fs.s3a.S3AFileSystem</code>. </li> <li> In <code>spark.properties</code> you probably want some settings that look like this: <code>spark.hadoop.fs.s3a.access.key=ACCESSKEY spark.hadoop.fs.s3a.secret.key=SECRETKEY</code> </li> <li> If you are using hadoop 2.7 version with spark then the aws client uses V2 as default auth signature. And all the new aws region support only V4 protocol. To use V4 pass these conf in spark-submit and also endpoint (format - <code>s3.<region>.amazonaws.com</code>) must be specified. </li> </ol> <code>--conf "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true</code> <code>--conf "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true</code> I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them.

I'm writing this answer to access files with S3A from Spark 2.0.1 on Hadoop 2.7.3 Copy the AWS jars(<code>hadoop-aws-2.7.3.jar</code> and <code>aws-java-sdk-1.7.4.jar</code>) which shipped with Hadoop by default <ul> <li> Hint: If the jar locations are unsure? Running find command as a privileged user can be helpful; commands can be <pre class="prettyprint"><code> find / -name hadoop-aws*.jar find / -name aws-java-sdk*.jar </code></pre> </li> </ul> into spark classpath which holds all spark jars <ul> <li> Hint: We can not directly point the location(It must be in property file) as I want to make an answer generic for distributions and Linux flavors. spark classpath can be identified by find command below <pre class="prettyprint"><code> find / -name spark-core*.jar </code></pre> </li> </ul> <h3>in <code>spark-defaults.conf</code> </h3> Hint: (Mostly it will be placed in <code>/etc/spark/conf/spark-defaults.conf</code>) <pre class="prettyprint"><code>#make sure jars are added to CLASSPATH spark.yarn.jars=file://{spark/home/dir}/jars/*.jar,file://{hadoop/install/dir}/share/hadoop/tools/lib/*.jar spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.access.key={s3a.access.key} spark.hadoop.fs.s3a.secret.key={s3a.secret.key} #you can set above 3 properties in hadoop level `core-site.xml` as well by removing spark prefix. </code></pre> in spark submit include jars(<code>aws-java-sdk</code> and <code>hadoop-aws</code>) in <code>--driver-class-path</code> if needed. <pre class="prettyprint"><code>spark-submit --master yarn \ --driver-class-path {spark/jars/home/dir}/aws-java-sdk-1.7.4.jar \ --driver-class-path {spark/jars/home/dir}/hadoop-aws-2.7.3.jar \ other options </code></pre> <blockquote> Note: Make sure the Linux user with reading privileges, before running the <code>find</code> command to prevent error Permission denied </blockquote>

How to access s3a:// files from Apache Spark?

2 Answers

Having experienced first hand the difference between s3a and s3n - 7.9GB of data transferred on s3a was around ~7 minutes while 7.9GB of data on s3n took 73 minutes [us-east-1 to us-west-1 unfortunately in both cases; Redshift and Lambda being us-east-1 at this time] this is a very important piece of the stack to get correct and it's worth the frustration.

Here are the key parts, as of December 2015:

Your Spark cluster will need a Hadoop version 2.x or greater. If you use the Spark EC2 setup scripts and maybe missed it, the switch for using something other than 1.0 is to specify --hadoop-major-version 2 (which uses CDH 4.2 as of this writing).
You'll need to include what may at first seem to be an out of date AWS SDK library (built in 2014 as version 1.7.4) for versions of Hadoop as late as 2.7.1 (stable): aws-java-sdk 1.7.4. As far as I can tell using this along with the specific AWS SDK JARs for 1.10.8 hasn't broken anything.
You'll also need the hadoop-aws 2.7.1 JAR on the classpath. This JAR contains the class org.apache.hadoop.fs.s3a.S3AFileSystem.
In spark.properties you probably want some settings that look like this:

spark.hadoop.fs.s3a.access.key=ACCESSKEY spark.hadoop.fs.s3a.secret.key=SECRETKEY
If you are using hadoop 2.7 version with spark then the aws client uses V2 as default auth signature. And all the new aws region support only V4 protocol. To use V4 pass these conf in spark-submit and also endpoint (format - s3.<region>.amazonaws.com) must be specified.

--conf "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

--conf "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true

I've detailed this list in more detail on a post I wrote as I worked my way through this process. In addition I've covered all the exception cases I hit along the way and what I believe to be the cause of each and how to fix them.

115

answered Sep 20 '22 11:09

cfeduke

I'm writing this answer to access files with S3A from Spark 2.0.1 on Hadoop 2.7.3

Copy the AWS jars(hadoop-aws-2.7.3.jar and aws-java-sdk-1.7.4.jar) which shipped with Hadoop by default

Hint: If the jar locations are unsure? Running find command as a privileged user can be helpful; commands can be
```
  find / -name hadoop-aws*.jar   find / -name aws-java-sdk*.jar 
```

into spark classpath which holds all spark jars

Hint: We can not directly point the location(It must be in property file) as I want to make an answer generic for distributions and Linux flavors. spark classpath can be identified by find command below
```
  find / -name spark-core*.jar 
```

in `spark-defaults.conf`

Hint: (Mostly it will be placed in /etc/spark/conf/spark-defaults.conf)

#make sure jars are added to CLASSPATH spark.yarn.jars=file://{spark/home/dir}/jars/*.jar,file://{hadoop/install/dir}/share/hadoop/tools/lib/*.jar   spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem   spark.hadoop.fs.s3a.access.key={s3a.access.key}  spark.hadoop.fs.s3a.secret.key={s3a.secret.key}  #you can set above 3 properties in hadoop level `core-site.xml` as well by removing spark prefix.

in spark submit include jars(aws-java-sdk and hadoop-aws) in --driver-class-path if needed.

spark-submit --master yarn \   --driver-class-path {spark/jars/home/dir}/aws-java-sdk-1.7.4.jar \   --driver-class-path {spark/jars/home/dir}/hadoop-aws-2.7.3.jar \   other options

Note:

Make sure the Linux user with reading privileges, before running the find command to prevent error Permission denied

answered Sep 22 '22 11:09

mrsrinivas

Related questions
                            
                                Permission denied at hdfs
                            
                                Java vs Python on Hadoop
                            
                                How to stop/kill Airflow tasks from the UI
                            
                                How to load data to hive from HDFS without removing the source file?
                            
                                Just get column names from hive table
                            
                                Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)
                            
                                Does Hive have a String split function?
                            
                                Namenode not getting started
                            
                                Hbase quickly count number of rows
                            
                                Scalable Image Storage
                            
                                Difference between hadoop fs -put and hadoop fs -copyFromLocal
                            
                                PIG how to count a number of rows in alias
                            
                                How does Hive compare to HBase?
                            
                                How does impala provide faster query response compared to hive
                            
                                Hadoop on OSX "Unable to load realm info from SCDynamicStore"
                            
                                how to kill hadoop jobs
                            
                                Write a file in hdfs with Java
                            
                                How to check Spark Version [closed]
                            
                                Life without JOINs... understanding, and common practices
                            
                                Stop Java Coffee Cup icon from appearing in the Dock on Mac OSX

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to access s3a:// files from Apache Spark?

Tags:

amazon-s3

apache-spark

hadoop

tribbloid

People also ask

2 Answers

cfeduke

in `spark-defaults.conf`

mrsrinivas

Recent Activity

Donate For Us

How to access s3a:// files from Apache Spark?

Tags:

amazon-s3

apache-spark

hadoop

tribbloid

People also ask

2 Answers

cfeduke

in spark-defaults.conf

mrsrinivas

Related questions

Recent Activity

Donate For Us

in `spark-defaults.conf`