copy files from amazon s3 to hdfs using s3distcp fails

Tags:

I am trying to copy files from s3 to hdfs using workflow in EMR and when I run the below command the jobflow successfully starts but gives me an error when it tries to copy the file to HDFS .Do i need to set any input file permissions ?

Command:

./elastic-mapreduce --jobflow j-35D6JOYEDCELA --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://odsh/input/,--dest,hdfs:///Users

Output

Task TASKID="task_201301310606_0001_r_000000" TASK_TYPE="REDUCE" TASK_STATUS="FAILED" FINISH_TIME="1359612576612" ERROR="java.lang.RuntimeException: Reducer task failed to copy 1 files: s3://odsh/input/GL_01112_20121019.dat etc at com.amazon.external.elasticmapreduce.s3distcp.CopyFilesReducer.close(CopyFilesReducer.java:70) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:538) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:429) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249)

259

asked Jan 31 '13 17:01

raghuram gururajan

1 Answers

I'm getting the same exception. It looks like the bug is caused by a race condition when CopyFilesReducer uses multiple CopyFilesRunable instances to download the files from S3. The problem is that it uses the same temp directory in multiple threads, and the threads delete the temp directory when they're done. Hence, when one thread completes before another it deletes the temp directory that another thread is still using.

I've reported the problem to AWS, but in the mean time you can work around the bug by forcing the reducer to use a single thread by setting the variable s3DistCp.copyfiles.mapper.numWorkers to 1 in your job config.

170

answered Oct 20 '22 01:10

user1995521

Related questions
                            
                                Accessing files in HDFS using Java
                            
                                Hadoop Pig count number
                            
                                HDFS error: target already exists
                            
                                Hive is not showing tables
                            
                                Data visualisation tools availble on hive hadoop
                            
                                Create HIVE partitioned table HDFS location assistance
                            
                                How to rename huge amount of files in Hadoop/Spark?
                            
                                HDInsight: HBase or Azure Table Storage?
                            
                                Spark on embedded mode - user/hive/warehouse not found
                            
                                What happens if an RDD can't fit into memory in Spark? [duplicate]
                            
                                spark returns error libsnappyjava.so: failed to map segment from shared object: Operation not permitted
                            
                                Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?
                            
                                Hive: Best way to do incremetal updates on a main table
                            
                                start-all.sh, start-dfs.sh command not found
                            
                                Spark submit YARN mode HADOOP_CONF_DIR contents
                            
                                Merging small files in hadoop
                            
                                1 million sentences to save in DB - removing non-relevant English words
                            
                                Flatten tuple like a bag
                            
                                In Hadoop, where can i change default url ports 50070 and 50030 for namenode and jobtracker webpages
                            
                                Hadoop - Hive : Delete data which is older than specified no of days

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

copy files from amazon s3 to hdfs using s3distcp fails

Tags:

amazon-s3

hadoop

hdfs

elastic-map-reduce

raghuram gururajan

People also ask

1 Answers

user1995521

Recent Activity

Donate For Us