I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster. I have added the aws credentials in core-site.xml: <pre class="prettyprint"><code><property> <name>fs.s3.awsAccessKeyId</name> <value>some id</value> </property> <property> <name>fs.s3n.awsAccessKeyId</name> <value>some id</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>some key</value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <value>some key</value> </property> </code></pre> Note: Since there are some slashes on the key, I have escaped them with %2F If I try to list the contents of the bucket: <pre class="prettyprint"><code>hadoop fs -ls s3://some-url/bucket/ </code></pre> I get this error: ls: No FileSystem for scheme: s3 I edited core-site.xml again, and added information related to the fs: <pre class="prettyprint"><code><property> <name>fs.s3.impl</name> <value>org.apache.hadoop.fs.s3.S3FileSystem</value> </property> <property> <name>fs.s3n.impl</name> <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value> </property> </code></pre> This time I get a different error: <pre class="prettyprint"><code>-ls: Fatal internal error java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) </code></pre> Somehow I suspect the Yarn distribution does not have the necessary jars to be able to read S3, but I have no idea where to get those. Any pointers in this direction would be greatly appreciated.

For some reason, the jar <code>hadoop-aws-[version].jar</code> which contains the implementation to <code>NativeS3FileSystem</code> is not present in the <code>classpath</code> of hadoop by default in the version 2.6 & 2.7. So, try and add it to the classpath by adding the following line in <code>hadoop-env.sh</code> which is located in <code>$HADOOP_HOME/etc/hadoop/hadoop-env.sh</code>: <pre class="prettyprint"><code>export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/* </code></pre> <blockquote> Assuming you are using Apache Hadoop 2.6 or 2.7 </blockquote> By the way, you could check the classpath of Hadoop using: <pre class="prettyprint"><code>bin/hadoop classpath </code></pre>

If you are using HDP 2.x or greater you can try modifying the following property in the MapReduce2 configuration settings in Ambari. mapreduce.application.classpath Append the following value to the end of the existing string: /usr/hdp/${hdp.version}/hadoop-mapreduce/*

How can I access S3/S3n from a local Hadoop 2.6 installation?

Tags:

amazon-web-services

amazon-s3

hadoop

hadoop2

hadoop-yarn

I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster.

I have added the aws credentials in core-site.xml:

<property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>some id</value>
</property>

<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>some id</value>
</property>

<property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>some key</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>some key</value>
</property>

Note: Since there are some slashes on the key, I have escaped them with %2F

If I try to list the contents of the bucket:

hadoop fs -ls s3://some-url/bucket/

I get this error:

ls: No FileSystem for scheme: s3

I edited core-site.xml again, and added information related to the fs:

<property>
  <name>fs.s3.impl</name>
  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

<property>
  <name>fs.s3n.impl</name>
  <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>

This time I get a different error:

-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)

Somehow I suspect the Yarn distribution does not have the necessary jars to be able to read S3, but I have no idea where to get those. Any pointers in this direction would be greatly appreciated.

226

asked Jan 19 '15 16:01

doublebyte

5 Answers

For some reason, the jar hadoop-aws-[version].jar which contains the implementation to NativeS3FileSystem is not present in the classpath of hadoop by default in the version 2.6 & 2.7. So, try and add it to the classpath by adding the following line in hadoop-env.sh which is located in $HADOOP_HOME/etc/hadoop/hadoop-env.sh:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

Assuming you are using Apache Hadoop 2.6 or 2.7

By the way, you could check the classpath of Hadoop using:

bin/hadoop classpath

answered Oct 17 '22 21:10

Ashrith

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
myAccessKey = input() 
mySecretKey = input()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

df = sqlContext.read.parquet("s3://myBucket/myKey")

answered Oct 17 '22 22:10

Kamil Sindi

@Ashrith's answer worked for me with one modification: I had to use $HADOOP_PREFIX rather than $HADOOP_HOME when running v2.6 on Ubuntu. Perhaps this is because it sounds like $HADOOP_HOME is being deprecated?

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HADOOP_PREFIX}/share/hadoop/tools/lib/*

Having said that, neither worked for me on my Mac with v2.6 installed via Homebrew. In that case, I'm using this extremely cludgy export:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$(brew --prefix hadoop)/libexec/share/hadoop/tools/lib/*

answered Oct 17 '22 22:10

Matt K

To resolve this issue I tried all the above, which failed (for my environment anyway).

However I was able to get it working by copying the two jars mentioned above from the tools dir and into common/lib.

Worked fine after that.

answered Oct 17 '22 22:10

null

If you are using HDP 2.x or greater you can try modifying the following property in the MapReduce2 configuration settings in Ambari.

mapreduce.application.classpath

Append the following value to the end of the existing string:

/usr/hdp/${hdp.version}/hadoop-mapreduce/*

answered Oct 17 '22 20:10

David Kjerrumgaard

Related questions
                            
                                Accessing stream output from hdfs of MRjob
                            
                                Add a column in a table in HIVE QL
                            
                                Difference between `hadoop dfs` and `hadoop fs` [closed]
                            
                                How to convert .txt file to Hadoop's sequence file format
                            
                                Hadoop speculative task execution
                            
                                Select top 2 rows in Hive
                            
                                apache spark - check if file exists
                            
                                Why do I need to source bash_profile every time
                            
                                Would Spark unpersist the RDD itself when it realizes it won't be used anymore?
                            
                                Alter hive table add or drop column
                            
                                Merging multiple files into one within Hadoop
                            
                                Hive query to quickly find table size (number of rows)
                            
                                No data nodes are started
                            
                                Spark-submit not working when application jar is in hdfs
                            
                                Hadoop: Connecting to ResourceManager failed
                            
                                How can I force Spark to execute code?
                            
                                Is there a hdfs command to list files in HDFS directory as per timestamp
                            
                                Primary keys with Apache Spark
                            
                                How to write to CSV in Spark
                            
                                There are 0 datanode(s) running and no node(s) are excluded in this operation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I access S3/S3n from a local Hadoop 2.6 installation?

Tags:

amazon-web-services

amazon-s3

hadoop

hadoop2

hadoop-yarn

doublebyte

People also ask

5 Answers

Ashrith

Kamil Sindi

Matt K

null

David Kjerrumgaard

Recent Activity

Donate For Us