I'm trying to load some data from an Amazon S3 bucket by:
SparkConf sparkConf = new SparkConf().setAppName("Importer");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
HiveContext sqlContext = new HiveContext(ctx.sc());
DataFrame magento = sqlContext.read().json("https://s3.eu-central-1.amazonaws.com/*/*.json");
This last line however throws an error:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: https
The same line has been working in another project, what am I missing? I'm running Spark on a Hortonworks CentOS VM.
By default Spark supports HDFS, S3 and local. S3 can be accessed by s3a:// or s3n:// protocols (difference between s3a, s3n and s3 protocols)
So to access a file the best is to use the following:
s3a://bucket-name/key
Depending on your spark version and included libraries you may need to add external jars:
Spark read file from S3 using sc.textFile ("s3n://...)
(Are you sure that you were using s3 with https protocol in previous projects? Maybe you had custom code or jars included to support https protocol?)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With