I am trying to access s3 files from local spark context using pySpark.
I keep getting File "C:\Spark\python\lib\py4j-0.9-src.zip\py4j\protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
I had set os.environ['AWS_ACCESS_KEY_ID']
and
os.environ['AWS_SECRET_ACCESS_KEY']
before I called df = sqc.read.parquet(input_path)
. I also added these lines:
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsSecretAccessKey", os.environ["AWS_SECRET_ACCESS_KEY"])
hadoopConf.set("fs.s3.awsAccessKeyId", os.environ["AWS_ACCESS_KEY_ID"])
I have also tried changing s3
to s3n
, s3a
. Neither worked.
Any idea how to make it work? I am on Windows 10, pySpark, Spark 1.6.1 built for Hadoop 2.6.0
I'm running pyspark appending the libraries from hadoop-aws.
You will need to use s3n in your input path. I'm running that from Mac-OS. so I'm not sure if it will work in Windows.
$SPARK_HOME/bin/pyspark --packages org.apache.hadoop:hadoop-aws:2.7.1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With