Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use HDFS with EMR?

I feel that connecting EMR to Amazon S3 is highly unreliable because of the dependency on network speed.

I can only find links for describing an S3 location. I want to use EMR with HDFS - how do I do this?

like image 797
user3237842 Avatar asked Mar 12 '14 06:03

user3237842


People also ask

Can we use HDFS in EMR?

HDFS and EMRFS are the two main file systems used with Amazon EMR. Beginning with Amazon EMR release version 5.22.

Does EMR use HDFS or S3?

HDFS and the EMR File System (EMRFS), which uses Amazon S3, are both compatible with Amazon EMR, but they're not interchangeable. HDFS is an implementation of the Hadoop FileSystem API, which models POSIX file system behavior. EMRFS is an object store, not a file system.

Can HDFS be used as a storage option for an EMR cluster?

Storage in EMR clusterHDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster.


1 Answers

You can just use hdfs input and output paths like hdfs:///input/.

Say you have a job added to a cluster as follows:

ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg s3:/input --arg s3:/output

instead you can have it as follows if you need it to be on hdfs:

ruby elastic-mapreduce -j $jobflow --jar s3:/my-jar-location/myjar.jar --arg hdfs:///input --arg hdfs:///output

In order to interact with the HDFS on EMR cluster, ssh to the master node and execute general HDFS commands. For example to see the output file, you might do as follows:

hadoop fs -get hdfs://output/part-r-0000 /home/ec2-user/firstPartOutputFile

But if you are working with transient clusters, using in-situ HDFS is discouraged, as you will lose data when cluster is terminated.

Also I have benchmarks which prove that using S3 or HDFS doesn't provide much of performance difference. For a workload of ~200GB: - Job got finished in 22 seconds with S3 as input source - Job got finished in 20 seconds with HDFS as input source

EMR is super optimized to read/write data from/to S3.

For intermediate steps' output writing into hdfs is best. So, say if you have 3 steps in your pipeline, then you may have input/output as follows:

  • Step 1: Input from S3, Output in HDFS
  • Step 2: Input from HDFS, Output in HDFS
  • Step 3: Input from HDFS, Output in S3
like image 62
Amar Avatar answered Sep 25 '22 00:09

Amar