Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I copy files from S3 to Amazon EMR HDFS?

I'm running hive over EMR, and need to copy some files to all EMR instances.

One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to copy stright from S3 to HDFS.

What is the best way to go about this?

like image 795
Tomer Avatar asked Sep 20 '11 14:09

Tomer


People also ask

Can EMR read data from S3?

Amazon EMR uses the AWS SDK for Java with Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as buckets. Buckets have certain restrictions and limitations to conform with Amazon S3 and DNS requirements.

How do I upload files to EMR?

1) To upload files, simply drag and drop any file onto the page. Alternatively, click on the Upload Files link and the Open file window will appear. Now select your chosen documents (hold Control/Command) to select multiple items) and press Open.

Does Amazon EMR use HDFS?

HDFS is automatically installed with Hadoop on your Amazon EMR cluster, and you can use HDFS along with Amazon S3 to store your input and output data. You can easily encrypt HDFS using an Amazon EMR security configuration.


2 Answers

the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):

% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile

This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.

I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.

like image 200
Patrick Salami Avatar answered Oct 08 '22 18:10

Patrick Salami


Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .

S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services (AWS), particularly Amazon Simple Storage Service (Amazon S3). You use S3DistCp by adding it as a step in a job flow. Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by subsequent steps in your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3

Example Copy log files from Amazon S3 to HDFS

This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.

elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \ s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \ --args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\ --dest,hdfs:///output,\ --srcPattern,.*daemons.*-hadoop-.*' 
like image 40
Amar Avatar answered Oct 08 '22 17:10

Amar