I am using aws .net sdk to run a s3distcp job to EMR to concatenate all files in a folder with --groupBy arg. But whatever "groupBy" arg I have tried, it failed all the time or just copy the files without concatenating like if no --groupBy specified in the arg list. The files in the folder is spark saveAsTextFiles named like below: part-0000 part-0001 part-0002 ... ... <pre class="prettyprint"><code>step.HadoopJarStep = new HadoopJarStepConfig { Jar = "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar", Args = new List<string> { "--s3Endpoint=s3-eu-west-1.amazonaws.com", "--src=s3://foo/spark/result/bar" , "--dest=s3://foo/spark/result-merged/bar", "--groupBy=(part.*)", "--targetSize=256" } }; </code></pre>

After all the struggling with this whole day, in the end I got it worked with the groupKey arg below: <pre class="prettyprint"><code>--groupBy=.*part.*(\w+) </code></pre> But even if I add <code>--targetSize=1024</code> to args s3distcp produced 2,5MB - 3MB files. Does anyone have any idea about it? ** *UPDATE * ** Here is the groupBy clause which is concatenating all the files into one file, in their own folder: <pre class="prettyprint"><code>.*/(\\w+)/.* </code></pre> The last "/" is so important here --source="s3://foo/spark/result/" There are some folders in "result" folder: <pre class="prettyprint"><code>s3://foo/spark/result/foo s3://foo/spark/result/bar s3://foo/spark/result/lorem s3://foo/spark/result/ipsum </code></pre> and in each folder above there are hundreds of files like: <pre class="prettyprint"><code>part-0000 part-0001 part-0002 </code></pre> <code>.*/(\\w+)/.*</code> this group by clause group every file in every folder so in the end you got one file for each folder with the folder name <pre class="prettyprint"><code>s3://foo/spark/result-merged/foo/foo -> File s3://foo/spark/result-merged/bar/bar -> File s3://foo/spark/result-merged/lorem/lorem -> File s3://foo/spark/result-merged/ipsum/ipsum -> File </code></pre> So, this is the final working command for me: <pre class="prettyprint"><code>s3-dist-cp --src s3://foo/spark/result/ --dest s3://foo/spark/results-merged --groupBy '.*/(\\w+)/.*' --targetSize 1024 </code></pre> Thanks.

How to EMR S3DistCp groupBy properly?

Tags:

amazon-emr

distcp

s3distcp

I am using aws .net sdk to run a s3distcp job to EMR to concatenate all files in a folder with --groupBy arg. But whatever "groupBy" arg I have tried, it failed all the time or just copy the files without concatenating like if no --groupBy specified in the arg list.

The files in the folder is spark saveAsTextFiles named like below:

part-0000
part-0001
part-0002
...
...

step.HadoopJarStep = new HadoopJarStepConfig
            {
                Jar = "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
                Args = new List<string>
                {
                    "--s3Endpoint=s3-eu-west-1.amazonaws.com",
                    "--src=s3://foo/spark/result/bar" ,
                    "--dest=s3://foo/spark/result-merged/bar",
                    "--groupBy=(part.*)",
                    "--targetSize=256"

                }
            };

871

asked Jul 14 '16 12:07

Barbaros Alp

1 Answers

After all the struggling with this whole day, in the end I got it worked with the groupKey arg below:

--groupBy=.*part.*(\w+)

But even if I add --targetSize=1024 to args s3distcp produced 2,5MB - 3MB files. Does anyone have any idea about it?

** *UPDATE * **

Here is the groupBy clause which is concatenating all the files into one file, in their own folder:

.*/(\\w+)/.*

The last "/" is so important here --source="s3://foo/spark/result/"

There are some folders in "result" folder:

s3://foo/spark/result/foo
s3://foo/spark/result/bar
s3://foo/spark/result/lorem
s3://foo/spark/result/ipsum

and in each folder above there are hundreds of files like:

part-0000
part-0001
part-0002

.*/(\\w+)/.* this group by clause group every file in every folder so in the end you got one file for each folder with the folder name

s3://foo/spark/result-merged/foo/foo -> File
s3://foo/spark/result-merged/bar/bar -> File
s3://foo/spark/result-merged/lorem/lorem -> File
s3://foo/spark/result-merged/ipsum/ipsum -> File

So, this is the final working command for me:

s3-dist-cp --src s3://foo/spark/result/  --dest s3://foo/spark/results-merged --groupBy '.*/(\\w+)/.*' --targetSize 1024

Thanks.

172

answered Sep 23 '22 02:09

Barbaros Alp

Related questions
                            
                                How to set spark.driver.memory for Spark/Zeppelin on EMR
                            
                                How to run 2 EMR Spark Step Concurrently?
                            
                                Pros and Cons of Amazon SageMaker VS. Amazon EMR, for deploying TensorFlow-based deep learning models?
                            
                                Spark SQL fails because "Constant pool has grown past JVM limit of 0xFFFF"
                            
                                Notebooks on EMR (AWS): Failed to start kernel
                            
                                Access credential for EMR Jupyter Notebook
                            
                                EMR conf spark-default settings
                            
                                Amazon EMR and Spark streaming
                            
                                Spark 2.2.0 - How to write/read DataFrame to DynamoDB
                            
                                Permission denied: user=zeppelin while using %spark.pyspark interpreter in AWS EMR cluster
                            
                                How do I use HDFS with EMR?
                            
                                Running scala 2.12 on emr 5.29.0
                            
                                Cannot use apache flink in amazon emr
                            
                                Elastic Map Reduce External Jars
                            
                                Pydoop on Amazon EMR
                            
                                Spark job just hangs with large data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to EMR S3DistCp groupBy properly?

Tags:

amazon-emr

distcp

s3distcp

Barbaros Alp

People also ask

1 Answers

Barbaros Alp

Recent Activity

Donate For Us