Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - No FileSystem for scheme: https, cannot load files from Amazon S3

I'm trying to load some data from an Amazon S3 bucket by:

SparkConf sparkConf = new SparkConf().setAppName("Importer");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
HiveContext sqlContext = new HiveContext(ctx.sc());

DataFrame magento = sqlContext.read().json("https://s3.eu-central-1.amazonaws.com/*/*.json");

This last line however throws an error:

Exception in thread "main" java.io.IOException: No FileSystem for scheme: https

The same line has been working in another project, what am I missing? I'm running Spark on a Hortonworks CentOS VM.

like image 202
lte__ Avatar asked Sep 06 '16 18:09

lte__


1 Answers

By default Spark supports HDFS, S3 and local. S3 can be accessed by s3a:// or s3n:// protocols (difference between s3a, s3n and s3 protocols)

So to access a file the best is to use the following:

s3a://bucket-name/key

Depending on your spark version and included libraries you may need to add external jars:

Spark read file from S3 using sc.textFile ("s3n://...)

(Are you sure that you were using s3 with https protocol in previous projects? Maybe you had custom code or jars included to support https protocol?)

like image 172
Piotr Reszke Avatar answered Nov 20 '22 01:11

Piotr Reszke