Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Emrfs file sync with s3 not working

After running a spark job on an Amazon EMR cluster, I deleted the output files directly from s3 and tried to rerun the job again. I received the following error upon trying to write to parquet file format on s3 using sqlContext.write:

'bucket/folder' present in the metadata but not s3
at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:455)

I tried running

emrfs sync s3://bucket/folder

which did not appear to resolve the error even though it did remove some records from the DynamoDB instance that keeps track of the metadata. Not sure what else I can try. How do I resolve this error?

like image 331
sakurashinken Avatar asked Oct 03 '16 01:10

sakurashinken


People also ask

What is Emrfs sync?

EMRFS creates a consistent view of objects in Amazon S3 by adding information about those objects to the EMRFS metadata. EMRFS adds these listings to its metadata when: An object written by EMRFS during the course of an Amazon EMR job. An object is synced with or imported to EMRFS metadata by using the EMRFS CLI.

Is S3 an Emrfs?

EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like data encryption.

Does Emrfs use DynamoDB?

EMRFS consistent view tracks consistency using a DynamoDB table to track objects in Amazon S3 that have been synced with or created by EMRFS. The metadata is used to track all operations (read, write, update, and copy), and no actual content is stored in it.

What is emrfs S3 plugin?

EMRFS S3 plugin - Amazon EMR To make it easier to provide access controls against objects in S3 on a multi-tenant cluster, the EMRFS S3 plugin provides access controls to the data within S3 when accessing it through EMRFS. You can allow access to S3 resources at a user and group level.

Can Amazon EMR use Amazon Simple Storage service (Amazon S3) instead of HDFS?

I want to configure Amazon EMR to use Amazon Simple Storage Service (Amazon S3) as the Apache Hadoop storage system instead of the Hadoop Distributed File System (HDFS). You can't configure Amazon EMR to use Amazon S3 instead of HDFS for the Hadoop storage layer.

How reliable is S3 for syncing files?

We have several hundred thousand files and S3 reliably syncs files. However, we have noticed that there were several files which were changed about a year ago and those are different but do not sync or update. Both source and destination timestamps are also different but the sync never happens. S3 has the more recent file.

Is emrfs a file system or object store?

EMRFS is an object store, not a file system. For more information, see Hadoop documentation for Object Stores vs. Filesystems. For recommendations about when to use each file system, see Work with storage and file systems. Did this article help? Do you need billing or technical support?


2 Answers

It turned out that I needed to run

emrfs delete s3://bucket/folder

first before running sync. Running the above solved the issue.

like image 76
sakurashinken Avatar answered Oct 24 '22 13:10

sakurashinken


Mostly the consistent problem comes due to retry logic in spark and hadoop systems. When a process of creating a file on s3 failed, but it already updated in the dynamodb. when the hadoop process restarts the process as the entry is already present in the dynamodb. It throws the consistent error.

If you want to delete the metadata of s3 which is stored in the dynamaoDB, whose objects are already removed. This are the steps, Delete all the metadata

Deletes all the objects in the path, emrfs delete uses the hash function to delete the records, so it may delete unwanted entries also, so we are doing the import and sync in the consequent steps

emrfs delete   s3://path

Retrieves the metadata for the objects that are physically present in s3 into dynamo db

emrfs import s3://path

Sync the data between s3 and the metadata.

emrfs sync s3://path      

After all the operations, to see whether that particular object is present in both s3 and metadata

emrfs diff s3://path 

http://docs.aws.amazon.com/emr/latest/ManagementGuide/emrfs-cli-reference.html

like image 21
loneStar Avatar answered Oct 24 '22 13:10

loneStar