I am using aws .net sdk to run a s3distcp job to EMR to concatenate all files in a folder with --groupBy arg. But whatever "groupBy" arg I have tried, it failed all the time or just copy the files without concatenating like if no --groupBy specified in the arg list.
The files in the folder is spark saveAsTextFiles named like below:
part-0000
part-0001
part-0002
...
...
step.HadoopJarStep = new HadoopJarStepConfig
{
Jar = "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
Args = new List<string>
{
"--s3Endpoint=s3-eu-west-1.amazonaws.com",
"--src=s3://foo/spark/result/bar" ,
"--dest=s3://foo/spark/result-merged/bar",
"--groupBy=(part.*)",
"--targetSize=256"
}
};
S3DistCp copies data using distributed map–reduce jobs, which is similar to DistCp. S3DistCp runs mappers to compile a list of files to copy to the destination. Upon completion of the mappers compiling a list of files, the reducers perform the actual data copy.
The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like data encryption.
Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers.
After all the struggling with this whole day, in the end I got it worked with the groupKey arg below:
--groupBy=.*part.*(\w+)
But even if I add --targetSize=1024
to args s3distcp produced 2,5MB - 3MB files.
Does anyone have any idea about it?
** *UPDATE * **
Here is the groupBy clause which is concatenating all the files into one file, in their own folder:
.*/(\\w+)/.*
The last "/" is so important here --source="s3://foo/spark/result/"
There are some folders in "result" folder:
s3://foo/spark/result/foo
s3://foo/spark/result/bar
s3://foo/spark/result/lorem
s3://foo/spark/result/ipsum
and in each folder above there are hundreds of files like:
part-0000
part-0001
part-0002
.*/(\\w+)/.*
this group by clause group every file in every folder so in the end you got one file for each folder with the folder name
s3://foo/spark/result-merged/foo/foo -> File
s3://foo/spark/result-merged/bar/bar -> File
s3://foo/spark/result-merged/lorem/lorem -> File
s3://foo/spark/result-merged/ipsum/ipsum -> File
So, this is the final working command for me:
s3-dist-cp --src s3://foo/spark/result/ --dest s3://foo/spark/results-merged --groupBy '.*/(\\w+)/.*' --targetSize 1024
Thanks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With