Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3.

This is the EMR step used in EMR Activity

s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath}

where

out.direcoryPath is :

s3://my-s3-bucket/output/#{format(@scheduledStartTime,"YYYY-MM-dd")}

So this creates one folder and one file in S3. (technically speaking it creates two keys 2017-03-18/<some_random_number> and 2017-03-18_$folder$)

2017-03-18
2017-03-18_$folder$

How to avoid creation of these extra empty _$folder$ files.

EDIT: I found a solution listed at https://issues.apache.org/jira/browse/HADOOP-10400 but I don't know how to implement it in AWS data pipeline.

like image 352
saurabh agarwal Avatar asked Mar 18 '17 15:03

saurabh agarwal


3 Answers

There's no way in S3 to actually create an empty folder. S3 is an object store so everything is an object in there.

When Hadoop uses it as a filesystem, it requires to organize those objects so that it appears as a file system tree, so it creates some special objects to mark an object as a directory.

You just store data files, but you can choose to organize those data files into paths, which creates a concept similar to folders for traversing.

Some tools including AWS Management Console mimic folders by interpreting /s in object names. The Amazon S3 console supports the folder concept as a means of grouping objects. So does the Bucket Explorer.

If you just don't create a folder, but place files in the path you want - that should work for you.

You don't have to create a folder before writing files to it in S3 because /all/path/including/filename - is a whole key in the S3 storage.

like image 68
leftjoin Avatar answered Oct 22 '22 19:10

leftjoin


use s3a while writing to s3 bucket, it will remove $folder$. i have tested this glue. not sure if it will apply in EMR clusters.

Credit:- answered by someone on reddit

from pyspark.sql import SparkSession
spark=SparkSession.builder.getOrCreate()
df=spark.read.format("parquet").load("s3://testingbucket/")
df.write.format("parquet").save("s3a://testingbucket/parttest/")
spark.stop()
like image 11
Pruthvi Raj Avatar answered Oct 22 '22 18:10

Pruthvi Raj


EMR doesn't seem to provide a way to avoid this.

Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the "_$folder$" suffix.

You can safely delete any empty files with the <directoryname>_$folder$ suffix that appear in your S3 buckets. These empty files are created by the Hadoop framework at runtime, but Hadoop is designed to process data even if these empty files are removed.

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

It's in the Hadoop source code, so it could be fixed, but apparently it's not fixed in EMR.

If you are feeling clever, you could create an S3 event notification that matches the _$folder$ suffix, and have it fire off a Lambda function to delete the objects after they're created.

like image 8
Michael - sqlbot Avatar answered Oct 22 '22 17:10

Michael - sqlbot