I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:
pyspark.SparkContext().textFile("s3n://user:password@bucket/key")
(note the s3n
[s3
did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials
file anyway.
So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials
file (ideally, without copying the credentials there to yet another configuration file)?
PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = …
and os.environ["AWS_SECRET_ACCESS_KEY"] = …
, it did not work.
PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty()
, sc.setLocalProperty()
, and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf)
. Nothing worked.
If you are using PySpark to access S3 buckets, you must pass the Spark engine the right packages to use, specifically aws-java-sdk and hadoop-aws . It'll be important to identify the right package version to use. As of this writing aws-java-sdk 's 1.7. 4 version and hadoop-aws 's 2.7.
With Amazon EMR release version 5.17. 0 and later, you can use S3 Select with Spark on Amazon EMR.
format("csv"). load("path") you can read a CSV file from Amazon S3 into a Spark DataFrame, Thes method takes a file path to read as an argument. By default read method considers header as a data record hence it reads column names on file as data, To overcome this we need to explicitly mention “true” for header option.
Yes, you have to use s3n
instead of s3
. s3
is some weird abuse of S3 the benefits of which are unclear to me.
You can pass the credentials to the sc.hadoopFile
or sc.newAPIHadoopFile
calls:
rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = { 'fs.s3n.awsAccessKeyId': '...', 'fs.s3n.awsSecretAccessKey': '...', })
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With