Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I have an error "java.io.FileNotFoundException: No such file or directory" while trying to create a dynamic frame using a notebook in AWS Glue

I'm setting up a new Jupyter Notebook in AWS Glue as a dev endpoint in order to test out some code for running an ETL script. So far I created a basic ETL script using AWS Glue but, for some reason, when trying to run the code on the Jupyter Notebook, I keep getting a FileNotFoundException.

I'm using a table (in the data catalog) that was created by an AWS Crawler to fetch the information associated with an S3 bucket and I'm able to actually get the filenames inside the bucket, but when I try to read the file using the dynamic frame, an FileNotFoundException is thrown.

Has anyone ever had this issue before?

This is running on N.Virginia AWS account. I've already set up the permissions, granted IAM roles to the AWS Glue service, setup the VPC endpoints and tried running the Job directly in AWS Glue, to no avail.

This is the basic code:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "xxx-database", table_name = "mytable_item", transformation_ctx = "datasource0")

datasource0.printSchema()
datasource0.show()

Alternatively:

datasource0 = glueContext.create_dynamic_frame.from_options('s3', connection_options={"paths":["s3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/"]}, format="json", transformation_ctx="")


datasource0.printSchema()
datasource0.show()

I would expect to receive a dynamic frame content, but this is actually throwing this error:

An error occurred while calling o343.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 23, ip-172-31-87-88.ec2.internal, executor 6): java.io.FileNotFoundException: No such file or directory 's3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/92387979-My-Table-Item-2016-09-11T16:30:00.000Z.json.gz'
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:826)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1206)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
    at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.initialize(TapeHadoopRecordReader.scala:99)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:182)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

Thanks in advance for any help given.

like image 607
hgpestana Avatar asked Jul 09 '19 18:07

hgpestana


People also ask

What is filenotfoundexception in Java?

java.io.FileNotFoundException which is a common exception which occurs while we try to access a file. FileNotFoundExcetion is thrown by constructors RandomAccessFile, FileInputStream, and FileOutputStream.

How to fix filenotfoundexception (Access Denied) in Java?

Try to create a file in a subfolder, for example, C:/somedir/somefile.txt instead of the root. If the file is being opened exclusively by some other process, opening it for either reading or writing will cause java.io.FileNotFoundException (Access is denied) exception. Make sure that the file is not opened by any other program or process.

What is filewriter 107 exception in Java?

at java.io.FileWriter. (FileWriter.java:107) Finally, the aforementioned exception can occur when the requested file exists, but it is already opened by another application. 2. How to deal with the java.io.filenotfoundexception

Why does AWS glue fail to create the notebook server?

If AWS Glue fails to create the notebook server for a development endpoint, it might be because of one of the following problems: AWS Glue passes an IAM role to Amazon EC2 when it is setting up the notebook server. The IAM role must have a trust relationship to Amazon EC2. The IAM role must have an instance profile of the same name.


1 Answers

Well, as Chris D'Englere and Harsh Bafna pointed out, it was indeed a permission's issue. As it turns out, I forgot to add specific S3 permissions for the objects (GetObject) inside the bucket and not only to the bucket itself.

Thanks for the help!

like image 70
hgpestana Avatar answered Sep 20 '22 22:09

hgpestana