Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to EMR S3DistCp groupBy properly?

I am using aws .net sdk to run a s3distcp job to EMR to concatenate all files in a folder with --groupBy arg. But whatever "groupBy" arg I have tried, it failed all the time or just copy the files without concatenating like if no --groupBy specified in the arg list.

The files in the folder is spark saveAsTextFiles named like below:

part-0000
part-0001
part-0002
...
...

step.HadoopJarStep = new HadoopJarStepConfig
            {
                Jar = "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
                Args = new List<string>
                {
                    "--s3Endpoint=s3-eu-west-1.amazonaws.com",
                    "--src=s3://foo/spark/result/bar" ,
                    "--dest=s3://foo/spark/result-merged/bar",
                    "--groupBy=(part.*)",
                    "--targetSize=256"

                }
            };
like image 871
Barbaros Alp Avatar asked Jul 14 '16 12:07

Barbaros Alp


People also ask

How does S3DistCp work?

S3DistCp copies data using distributed map–reduce jobs, which is similar to DistCp. S3DistCp runs mappers to compile a list of files to copy to the destination. Upon completion of the mappers compiling a list of files, the reducers perform the actual data copy.

What is Emrfs?

The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like data encryption.

What is serverless EMR?

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.


1 Answers

After all the struggling with this whole day, in the end I got it worked with the groupKey arg below:

--groupBy=.*part.*(\w+)

But even if I add --targetSize=1024 to args s3distcp produced 2,5MB - 3MB files. Does anyone have any idea about it?

** *UPDATE * **

Here is the groupBy clause which is concatenating all the files into one file, in their own folder:

.*/(\\w+)/.*

The last "/" is so important here --source="s3://foo/spark/result/"

There are some folders in "result" folder:

s3://foo/spark/result/foo
s3://foo/spark/result/bar
s3://foo/spark/result/lorem
s3://foo/spark/result/ipsum

and in each folder above there are hundreds of files like:

part-0000
part-0001
part-0002

.*/(\\w+)/.* this group by clause group every file in every folder so in the end you got one file for each folder with the folder name

s3://foo/spark/result-merged/foo/foo -> File
s3://foo/spark/result-merged/bar/bar -> File
s3://foo/spark/result-merged/lorem/lorem -> File
s3://foo/spark/result-merged/ipsum/ipsum -> File

So, this is the final working command for me:

s3-dist-cp --src s3://foo/spark/result/  --dest s3://foo/spark/results-merged --groupBy '.*/(\\w+)/.*' --targetSize 1024

Thanks.

like image 172
Barbaros Alp Avatar answered Sep 23 '22 02:09

Barbaros Alp