I'm setting up a new Jupyter Notebook in AWS Glue as a dev endpoint in order to test out some code for running an ETL script. So far I created a basic ETL script using AWS Glue but, for some reason, when trying to run the code on the Jupyter Notebook, I keep getting a FileNotFoundException
.
I'm using a table (in the data catalog) that was created by an AWS Crawler to fetch the information associated with an S3 bucket and I'm able to actually get the filenames inside the bucket, but when I try to read the file using the dynamic frame, an FileNotFoundException
is thrown.
Has anyone ever had this issue before?
This is running on N.Virginia AWS account. I've already set up the permissions, granted IAM roles to the AWS Glue service, setup the VPC endpoints and tried running the Job directly in AWS Glue, to no avail.
This is the basic code:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "xxx-database", table_name = "mytable_item", transformation_ctx = "datasource0")
datasource0.printSchema()
datasource0.show()
Alternatively:
datasource0 = glueContext.create_dynamic_frame.from_options('s3', connection_options={"paths":["s3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/"]}, format="json", transformation_ctx="")
datasource0.printSchema()
datasource0.show()
I would expect to receive a dynamic frame content, but this is actually throwing this error:
An error occurred while calling o343.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 23, ip-172-31-87-88.ec2.internal, executor 6): java.io.FileNotFoundException: No such file or directory 's3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/92387979-My-Table-Item-2016-09-11T16:30:00.000Z.json.gz'
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:826)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1206)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.initialize(TapeHadoopRecordReader.scala:99)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:182)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
Thanks in advance for any help given.
java.io.FileNotFoundException which is a common exception which occurs while we try to access a file. FileNotFoundExcetion is thrown by constructors RandomAccessFile, FileInputStream, and FileOutputStream.
Try to create a file in a subfolder, for example, C:/somedir/somefile.txt instead of the root. If the file is being opened exclusively by some other process, opening it for either reading or writing will cause java.io.FileNotFoundException (Access is denied) exception. Make sure that the file is not opened by any other program or process.
at java.io.FileWriter. (FileWriter.java:107) Finally, the aforementioned exception can occur when the requested file exists, but it is already opened by another application. 2. How to deal with the java.io.filenotfoundexception
If AWS Glue fails to create the notebook server for a development endpoint, it might be because of one of the following problems: AWS Glue passes an IAM role to Amazon EC2 when it is setting up the notebook server. The IAM role must have a trust relationship to Amazon EC2. The IAM role must have an instance profile of the same name.
Well, as Chris D'Englere and Harsh Bafna pointed out, it was indeed a permission's issue. As it turns out, I forgot to add specific S3 permissions for the objects (GetObject
) inside the bucket and not only to the bucket itself.
Thanks for the help!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With