I am trying to reproduce an Amazon EMR cluster on my local machine. For that purpose, I have installed the latest stable version of Hadoop as of now - 2.6.0. Now I would like to access an S3 bucket, as I do inside the EMR cluster.
I have added the aws credentials in core-site.xml:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>some id</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>some key</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>some key</value>
</property>
Note: Since there are some slashes on the key, I have escaped them with %2F
If I try to list the contents of the bucket:
hadoop fs -ls s3://some-url/bucket/
I get this error:
ls: No FileSystem for scheme: s3
I edited core-site.xml again, and added information related to the fs:
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
This time I get a different error:
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
Somehow I suspect the Yarn distribution does not have the necessary jars to be able to read S3, but I have no idea where to get those. Any pointers in this direction would be greatly appreciated.
While Apache Hadoop has traditionally worked with HDFS, S3 also meets Hadoop's file system requirements. Companies such as Netflix have used this compatibility to build Hadoop data warehouses that store information in S3, rather than HDFS.
fs. s3a. AnonymousAWSCredentialsProvider allows anonymous access to a publicly accessible S3 bucket without any credentials. It can be useful for accessing public data sets without requiring AWS credentials.
For some reason, the jar hadoop-aws-[version].jar
which contains the implementation to NativeS3FileSystem
is not present in the classpath
of hadoop by default in the version 2.6 & 2.7. So, try and add it to the classpath by adding the following line in hadoop-env.sh
which is located in $HADOOP_HOME/etc/hadoop/hadoop-env.sh
:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
Assuming you are using Apache Hadoop 2.6 or 2.7
By the way, you could check the classpath of Hadoop using:
bin/hadoop classpath
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'
import pyspark
sc = pyspark.SparkContext("local[*]")
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
hadoopConf = sc._jsc.hadoopConfiguration()
myAccessKey = input()
mySecretKey = input()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)
df = sqlContext.read.parquet("s3://myBucket/myKey")
@Ashrith's answer worked for me with one modification: I had to use $HADOOP_PREFIX
rather than $HADOOP_HOME
when running v2.6 on Ubuntu. Perhaps this is because it sounds like $HADOOP_HOME
is being deprecated?
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${HADOOP_PREFIX}/share/hadoop/tools/lib/*
Having said that, neither worked for me on my Mac with v2.6 installed via Homebrew. In that case, I'm using this extremely cludgy export:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$(brew --prefix hadoop)/libexec/share/hadoop/tools/lib/*
To resolve this issue I tried all the above, which failed (for my environment anyway).
However I was able to get it working by copying the two jars mentioned above from the tools dir and into common/lib.
Worked fine after that.
If you are using HDP 2.x or greater you can try modifying the following property in the MapReduce2 configuration settings in Ambari.
mapreduce.application.classpath
Append the following value to the end of the existing string:
/usr/hdp/${hdp.version}/hadoop-mapreduce/*
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With