I feel that connecting EMR to Amazon S3 is highly unreliable because of the dependency on network speed. I can only find links for describing an S3 location. I want to use EMR with HDFS - how do I do this?

You can just use hdfs input and output paths like <code>hdfs:///input/</code>. Say you have a job added to a cluster as follows: ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg s3:/input --arg s3:/output instead you can have it as follows if you need it to be on hdfs: ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg hdfs:///input --arg hdfs:///output In order to interact with the HDFS on EMR cluster, ssh to the master node and execute general HDFS commands. For example to see the output file, you might do as follows: <pre class="prettyprint"><code>hadoop fs -get hdfs://output/part-r-0000 /home/ec2-user/firstPartOutputFile </code></pre> But if you are working with transient clusters, using in-situ HDFS is discouraged, as you will lose data when cluster is terminated. Also I have benchmarks which prove that using S3 or HDFS doesn't provide much of performance difference. For a workload of ~200GB: - Job got finished in 22 seconds with S3 as input source - Job got finished in 20 seconds with HDFS as input source EMR is super optimized to read/write data from/to S3. For intermediate steps' output writing into hdfs is best. So, say if you have 3 steps in your pipeline, then you may have input/output as follows: <ul> <li>Step 1: Input from S3, Output in HDFS</li> <li>Step 2: Input from HDFS, Output in HDFS</li> <li>Step 3: Input from HDFS, Output in S3 </li> </ul>

How do I use HDFS with EMR?

1 Answers

You can just use hdfs input and output paths like hdfs:///input/.

Say you have a job added to a cluster as follows:

ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg s3:/input --arg s3:/output

instead you can have it as follows if you need it to be on hdfs:

ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg hdfs:///input --arg hdfs:///output

In order to interact with the HDFS on EMR cluster, ssh to the master node and execute general HDFS commands. For example to see the output file, you might do as follows:

hadoop fs -get hdfs://output/part-r-0000 /home/ec2-user/firstPartOutputFile

But if you are working with transient clusters, using in-situ HDFS is discouraged, as you will lose data when cluster is terminated.

Also I have benchmarks which prove that using S3 or HDFS doesn't provide much of performance difference. For a workload of ~200GB: - Job got finished in 22 seconds with S3 as input source - Job got finished in 20 seconds with HDFS as input source

EMR is super optimized to read/write data from/to S3.

For intermediate steps' output writing into hdfs is best. So, say if you have 3 steps in your pipeline, then you may have input/output as follows:

Step 1: Input from S3, Output in HDFS
Step 2: Input from HDFS, Output in HDFS
Step 3: Input from HDFS, Output in S3

answered Sep 25 '22 00:09

Amar

Related questions
                            
                                Errno 22 When downloading multiple files from S3 bucket "sub-folder"
                            
                                Can AWS Glue crawl Delta Lake table data?
                            
                                How to unload data from Redshift to S3?
                            
                                Rails 3.2 + Heroku + S3 + CloudFront: Not serving gzip css js
                            
                                AWS Cloudwatch monitoring for S3
                            
                                Amazon S3/OpenStack Swift API skeleton
                            
                                Ruby Backup gem failing when uploading to S3. connection reset after 37 min
                            
                                AWS CLI syncing S3 buckets with multiple credentials
                            
                                Spark is inventing his own AWS secretKey
                            
                                CloudFront got X-Cache: Error from cloudfront with Status Code 200
                            
                                Amazon S3 SdkClientException while Directory Uploading in Java
                            
                                Read a JSON from AWS S3 Using JavaScript
                            
                                AWS Lambda access local resource or storage C#
                            
                                How to resize images before uploading with Active Storage (linked with AWS)
                            
                                git and Amazon s3 [closed]
                            
                                Automatically detecting file changes and synchronizing via S3
                            
                                What is the main difference between Amazon S3 and Amazon EBS [closed]
                            
                                Why is it taking so long to retrieve a file URL from S3 using CarrierWave and Fog?
                            
                                How to Upload images from FileReader to Amazon s3, using meteor
                            
                                Is my Amazon S3 client synchronous or asynch?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I use HDFS with EMR?

Tags:

amazon-s3

hdfs

emr

amazon-emr

user3237842

People also ask

1 Answers

Amar

Recent Activity

Donate For Us