How to read input from S3 in a Spark Streaming EC2 cluster application

Tags:

I'm trying to make my Spark Streaming application reading his input from a S3 directory but I keep getting this exception after launching it with spark-submit script:

Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).     at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)     at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:606)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)     at org.apache.hadoop.fs.s3native.$Proxy6.initialize(Unknown Source)     at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:216)     at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)     at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)     at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:195)     at MainClass$.main(MainClass.scala:1190)     at MainClass.main(MainClass.scala)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:606)     at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)     at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)     at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I'm setting those variables through this block of code as suggested here http://spark.apache.org/docs/latest/ec2-scripts.html (bottom of the page):

val ssc = new org.apache.spark.streaming.StreamingContext(   conf,   Seconds(60)) ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",args(2)) ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",args(3))

args(2) and args(3) are my AWS Access Key ID and Secrete Access Key of course.

Why it keeps saying they are not set?

EDIT: I tried also this way but I get the same exception:

val lines = ssc.textFileStream("s3n://"+ args(2) +":"+ args(3) + "@<mybucket>/path/")

445

asked Jun 04 '14 22:06

gprivitera

1 Answers

Odd. Try also doing a .set on the sparkContext. Try also exporting env variables before you start the application:

export AWS_ACCESS_KEY_ID=<your access> export AWS_SECRET_ACCESS_KEY=<your secret>

^^this is how we do it.

UPDATE: According to @tribbloid the above broke in 1.3.0, now you have to faff around for ages and ages with hdfs-site.xml, or your can do (and this works in a spark-shell):

val hadoopConf = sc.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey) hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

answered Oct 11 '22 07:10

samthebest

Related questions
                            
                                How to mount a folder on amazon ec2 instance with private key using sshfs
                            
                                Tomcat startup, 8080 address already in use
                            
                                How to SSH and run commands in EC2 using boto3?
                            
                                deploy with capistrano using a pem file
                            
                                EC2 Ubuntu Instance - UNPROTECTED PRIVATE KEY FILE
                            
                                Installing Ruby 2.0 and Rails 4.0.0beta on AWS EC2
                            
                                how to install pip with yum on EC2
                            
                                Permissions error when connecting to EC2 via SSH on Mac OSx
                            
                                AWS Auto Scaling Group - Application Load Balancer Request Count Per Target
                            
                                Need help deciding between EBS vs S3 on Amazon Web Services
                            
                                Listing instance name among other data with aws-cli 1.3.6
                            
                                How to prevent downtime during AWS Elastic Beanstalk deployment of a new version of the app?
                            
                                What "desired instances" is needed for? AWS Amazon Webservices AutoScaling group
                            
                                How to understand Amazon ECS cluster
                            
                                Can I remove the public IP on my instance without terminating it?
                            
                                Transferring a file to an amazon ec2 instance using scp always gives me permission denied (publickey,gssapi-with-mic)
                            
                                'ansible_date_time' is undefined
                            
                                How to forward port in AWS Application load balancer (ALB) port forwarding
                            
                                Best way to launch aws ec2 instances with ansible
                            
                                Allowing users to ssh to an EC2 Ubuntu instance?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read input from S3 in a Spark Streaming EC2 cluster application

Tags:

amazon-s3

amazon-ec2

apache-spark

gprivitera

People also ask

1 Answers

samthebest

Recent Activity

Donate For Us