I have an error "java.io.FileNotFoundException: No such file or directory" while trying to create a dynamic frame using a notebook in AWS Glue

Tags:

I'm setting up a new Jupyter Notebook in AWS Glue as a dev endpoint in order to test out some code for running an ETL script. So far I created a basic ETL script using AWS Glue but, for some reason, when trying to run the code on the Jupyter Notebook, I keep getting a FileNotFoundException.

I'm using a table (in the data catalog) that was created by an AWS Crawler to fetch the information associated with an S3 bucket and I'm able to actually get the filenames inside the bucket, but when I try to read the file using the dynamic frame, an FileNotFoundException is thrown.

Has anyone ever had this issue before?

This is running on N.Virginia AWS account. I've already set up the permissions, granted IAM roles to the AWS Glue service, setup the VPC endpoints and tried running the Job directly in AWS Glue, to no avail.

This is the basic code:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "xxx-database", table_name = "mytable_item", transformation_ctx = "datasource0")

datasource0.printSchema()
datasource0.show()

Alternatively:

datasource0 = glueContext.create_dynamic_frame.from_options('s3', connection_options={"paths":["s3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/"]}, format="json", transformation_ctx="")


datasource0.printSchema()
datasource0.show()

I would expect to receive a dynamic frame content, but this is actually throwing this error:

An error occurred while calling o343.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 23, ip-172-31-87-88.ec2.internal, executor 6): java.io.FileNotFoundException: No such file or directory 's3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/92387979-My-Table-Item-2016-09-11T16:30:00.000Z.json.gz'
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:826)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1206)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
    at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.initialize(TapeHadoopRecordReader.scala:99)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:182)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

Thanks in advance for any help given.

607

asked Jul 09 '19 18:07

hgpestana

1 Answers

Well, as Chris D'Englere and Harsh Bafna pointed out, it was indeed a permission's issue. As it turns out, I forgot to add specific S3 permissions for the objects (GetObject) inside the bucket and not only to the bucket itself.

Thanks for the help!

answered Sep 20 '22 22:09

hgpestana

Related questions
                            
                                Calculate the size of all files in a bucket S3
                            
                                Upload jpg to S3: "The request body terminated unexpectedly"
                            
                                How can I upload a 'file' to S3 by creating a temp file, using AWS Lambda?
                            
                                S3A: fails while S3: works in Spark EMR
                            
                                On AWS S3, can I exclude a file from lifecycle rule
                            
                                How do I download an S3 file only if it has changed?
                            
                                Using tar.gz file as a source for Amazon Athena
                            
                                Stream and zip to S3 from AWS Lambda Node.JS
                            
                                Is there any way to explicitly send event message to dead letter queue from inside AWS lambda function on certain condition?
                            
                                upload data to S3 with sagemaker
                            
                                AWS static site file upload via boto 3 set the right content-types
                            
                                AWS S3 - permission to edit bucket policy
                            
                                How to untar a file on s3 directly on s3?
                            
                                Writing Spark DataFrame to Hive table through AWS Glue Data Cataloug
                            
                                AccessDeniedException while creating AWS Web Cloudfront Distribution
                            
                                Facing image loading issue from AWS S3
                            
                                How to add lifecycle rules to an S3 bucket using terraform?
                            
                                How to load Image data from s3 bucket to sagemaker notebook?
                            
                                Amazon S3 bucket policy allow access to ONLY specific http
                            
                                Unable to upload a file from sagemaker notebook to S3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

I have an error "java.io.FileNotFoundException: No such file or directory" while trying to create a dynamic frame using a notebook in AWS Glue

Tags:

amazon-s3

pyspark

etl

aws-glue

hgpestana

People also ask

1 Answers

hgpestana

Recent Activity

Donate For Us