Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found (Spark 1.6 Windows)

I am trying to access s3 files from local spark context using pySpark. I keep getting File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o20.parquet. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found

I had set os.environ['AWS_ACCESS_KEY_ID'] and os.environ['AWS_SECRET_ACCESS_KEY'] before I called df = sqc.read.parquet(input_path). I also added these lines: hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3.awsSecretAccessKey", os.environ["AWS_SECRET_ACCESS_KEY"]) hadoopConf.set("fs.s3.awsAccessKeyId", os.environ["AWS_ACCESS_KEY_ID"]) I have also tried changing s3 to s3n, s3a. Neither worked.

Any idea how to make it work? I am on Windows 10, pySpark, Spark 1.6.1 built for Hadoop 2.6.0

like image 447
Hanan Shteingart Avatar asked May 06 '16 11:05

Hanan Shteingart


Video Answer


1 Answers

I'm running pyspark appending the libraries from hadoop-aws.

You will need to use s3n in your input path. I'm running that from Mac-OS. so I'm not sure if it will work in Windows.

$SPARK_HOME/bin/pyspark --packages org.apache.hadoop:hadoop-aws:2.7.1
like image 78
Franzi Avatar answered Sep 16 '22 12:09

Franzi