After brew install apache-spark
, sc.textFile("s3n://...")
in spark-shell
fails with java.io.IOException: No FileSystem for scheme: s3n
. This is not the case in spark-shell
accessed through an EC2 machine launched with spark-ec2
. The homebrew formula appears to build with a sufficiently late version of Hadoop, and this error is thrown whether or not brew install hadoop
has been run first.
How can I install spark with homebrew such that it will be able to read s3n://
files?
S3 filesystems aren't enabled in Hadoop 2.6 by default. So Spark versions that built with hadoop2.6 have no any S3-based fs available too. Possible solutions:
Solution 1. Use Spark built with Hadoop 2.4 (just change file name to "spark-1.5.1-bin-hadoop2.4.tgz" and update sha256) and s3n:// fs will work.
Solution 2. Enable s3n:// filesystem.
Specify the --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
option when you start spark-shell.
Also you should set path to the required libraries: --conf spark.driver.extraClassPath=<path>/* --conf spark.executor.extraClassPath=<path>/*
where <path>
is the directory with hadoop-aws
, aws-java-sdk-1.7.4
and guava-11.0.2
jar's.
Solution 3. Use newer s3a:// filesystem. It's enabled by default. Path to the required libraries should be set too.
Note 1: Options can also be set in conf/spark-defaults.conf file so you don't need to provide them every time with --conf
, read the guide.
Note 2: You can point <path>
to share/hadoop/tools/lib directory in Hadoop 2.6+ distribution (s3a requires libraries from Hadoop 2.7+) or get required libraries from Maven Central (1, 2, 3).
Note 3: Provide credentials for s3n in environment variables, ~/.aws/config
file or --conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey=
.
s3a requires --conf spark.hadoop.fs.s3a.access.key= --conf spark.hadoop.fs.s3a.secret.key=
options (no environment variables or .aws-file).
Note 4: s3:// can be set as alias either for s3n (--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
) or s3a (--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With