Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

brew installed apache-spark unable to access s3 files

After brew install apache-spark, sc.textFile("s3n://...") in spark-shell fails with java.io.IOException: No FileSystem for scheme: s3n. This is not the case in spark-shell accessed through an EC2 machine launched with spark-ec2. The homebrew formula appears to build with a sufficiently late version of Hadoop, and this error is thrown whether or not brew install hadoop has been run first.

How can I install spark with homebrew such that it will be able to read s3n:// files?

like image 880
Walrus the Cat Avatar asked Oct 19 '22 22:10

Walrus the Cat


1 Answers

S3 filesystems aren't enabled in Hadoop 2.6 by default. So Spark versions that built with hadoop2.6 have no any S3-based fs available too. Possible solutions:

  • Solution 1. Use Spark built with Hadoop 2.4 (just change file name to "spark-1.5.1-bin-hadoop2.4.tgz" and update sha256) and s3n:// fs will work.

  • Solution 2. Enable s3n:// filesystem. Specify the --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem option when you start spark-shell.

    Also you should set path to the required libraries: --conf spark.driver.extraClassPath=<path>/* --conf spark.executor.extraClassPath=<path>/* where <path> is the directory with hadoop-aws, aws-java-sdk-1.7.4 and guava-11.0.2 jar's.

  • Solution 3. Use newer s3a:// filesystem. It's enabled by default. Path to the required libraries should be set too.

Note 1: Options can also be set in conf/spark-defaults.conf file so you don't need to provide them every time with --conf, read the guide.

Note 2: You can point <path> to share/hadoop/tools/lib directory in Hadoop 2.6+ distribution (s3a requires libraries from Hadoop 2.7+) or get required libraries from Maven Central (1, 2, 3).

Note 3: Provide credentials for s3n in environment variables, ~/.aws/config file or --conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey=.

s3a requires --conf spark.hadoop.fs.s3a.access.key= --conf spark.hadoop.fs.s3a.secret.key= options (no environment variables or .aws-file).

Note 4: s3:// can be set as alias either for s3n (--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem) or s3a (--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem).

like image 78
Ilya Avatar answered Oct 22 '22 01:10

Ilya