Locally reading S3 files through Spark (or better: pyspark)

Tags:

I want to read an S3 file from my (local) machine, through Spark (pyspark, really). Now, I keep getting authentication errors like

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

I looked everywhere here and on the web, tried many things, but apparently S3 has been changing over the last year or months, and all methods failed but one:

pyspark.SparkContext().textFile("s3n://user:password@bucket/key")

(note the s3n [s3 did not work]). Now, I don't want to use a URL with the user and password because they can appear in logs, and I am also not sure how to get them from the ~/.aws/credentials file anyway.

So, how can I read locally from S3 through Spark (or, better, pyspark) using the AWS credentials from the now standard ~/.aws/credentials file (ideally, without copying the credentials there to yet another configuration file)?

PS: I tried os.environ["AWS_ACCESS_KEY_ID"] = … and os.environ["AWS_SECRET_ACCESS_KEY"] = …, it did not work.

PPS: I am not sure where to "set the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties" (Google did not come up with anything). However, I did try many ways of setting these: SparkContext.setSystemProperty(), sc.setLocalProperty(), and conf = SparkConf(); conf.set(…); conf.set(…); sc = SparkContext(conf=conf). Nothing worked.

229

asked Apr 04 '15 07:04

Eric O Lebigot

1 Answers

Yes, you have to use s3n instead of s3. s3 is some weird abuse of S3 the benefits of which are unclear to me.

You can pass the credentials to the sc.hadoopFile or sc.newAPIHadoopFile calls:

rdd = sc.hadoopFile('s3n://my_bucket/my_file', conf = {   'fs.s3n.awsAccessKeyId': '...',   'fs.s3n.awsSecretAccessKey': '...', })

answered Sep 22 '22 13:09

Daniel Darabos

Related questions
                            
                                Why might Python's `from` form of an import statement bind a module name?
                            
                                Why are std::array::front and std::array::back not noexcept?
                            
                                How to "save with encoding"
                            
                                What is the rationale behind tentative definitions in C?
                            
                                What does compound let/const assignment mean?
                            
                                Java Script, Difficulty getting list of all nested frames in page
                            
                                CompletableFuture supplyAsync
                            
                                The inflate method for my binding is not found (using Android, Data Binding.)
                            
                                How do I make my android app appear in Ultra Power Saving Mode
                            
                                String gets assigned to a List without a compilation error [duplicate]
                            
                                What is the difference between `DeriveAnyClass` and an empty instance?
                            
                                Error: received error: [57] Socket is not connected - iOS 10

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With