Emrfs file sync with s3 not working

Tags:

After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to write to parquet file format on s3 using sqlContext.write:

'bucket/folder' present in the metadata but not s3
at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:455)

I tried running

emrfs sync s3://bucket/folder

which did not appear to resolve the error even though it did remove some records from the DynamoDB instance that keeps track of the metadata. Not sure what else I can try. How do I resolve this error?

331

asked Oct 03 '16 01:10

sakurashinken

2 Answers

It turned out that I needed to run

emrfs delete s3://bucket/folder

first before running sync. Running the above solved the issue.

answered Oct 24 '22 13:10

sakurashinken

Mostly the consistent problem comes due to retry logic in spark and hadoop systems. When a process of creating a file on s3 failed, but it already updated in the dynamodb. when the hadoop process restarts the process as the entry is already present in the dynamodb. It throws the consistent error.

If you want to delete the metadata of s3 which is stored in the dynamaoDB, whose objects are already removed. This are the steps, Delete all the metadata

Deletes all the objects in the path, emrfs delete uses the hash function to delete the records, so it may delete unwanted entries also, so we are doing the import and sync in the consequent steps

emrfs delete   s3://path

Retrieves the metadata for the objects that are physically present in s3 into dynamo db

emrfs import s3://path

Sync the data between s3 and the metadata.

emrfs sync s3://path

After all the operations, to see whether that particular object is present in both s3 and metadata

emrfs diff s3://path

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-cli-reference.html

answered Oct 24 '22 13:10

loneStar

Related questions
                            
                                AWS S3 The security of a signed URL as a hyperlink
                            
                                What is the difference between S3.Client.upload_file() and S3.Client.upload_fileobj()?
                            
                                linode vs amazon ec2 vs heroku for project with amazon s3
                            
                                AWS S3 file search using Java
                            
                                Amazon S3 - How to properly build URLs pointing to the objects in a bucket?
                            
                                Downloading a file from an s3 Bucket to the USERS computer
                            
                                Running EMR Spark With Multiple S3 Accounts
                            
                                How to create download link for an Amazon S3 bucket's object?
                            
                                Amazon S3 bucket policies don't support "version" option
                            
                                How do I read a csv stored in S3 with csv.DictReader?
                            
                                AWS S3 Standard Infrequent Access vs Reduced Redundancy storage class when coupled with CloudFront?
                            
                                Prawn image on amazon image not found
                            
                                NSURLSession/NSURLConnection HTTP load failed (kCFStreamErrorDomainSSL, -9802) error in https connection
                            
                                Stream file upload to AWS S3 using go
                            
                                Cannot exit AWS CLI help pages
                            
                                aws beanstalk 403 error while deploying
                            
                                Angular routing is not working with Cloudfront and S3
                            
                                Force-Download with php on Amazon S3
                            
                                React Native upload to S3 with presigned URL
                            
                                Amazon AWS S3 signed URL via Wget

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Emrfs file sync with s3 not working

Tags:

amazon-s3

pyspark

amazon-emr

sakurashinken

People also ask

2 Answers

sakurashinken

loneStar

Recent Activity

Donate For Us