Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to overwrite/reuse the existing output path for Hadoop jobs again and agian

Tags:

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day's job run results. If I specify the same output directory it gives the error "output directory already exists".

How to bypass this validation?

like image 779
yogesh Avatar asked Oct 10 '11 13:10

yogesh


People also ask

When running MapReduce program if you provide the output directory path which already exists then what will happen?

What will happen if the output directory already exists for a MapReduce job? The job will overwrite the files from that directory and store the output generated in the directory.

Can we have Hadoop job output in multiple directories?

Yes, it is possible to have the output of Hadoop MapReduce Job written to multiple directories.

What is a mapper and reducer in Hadoop?

The mapper processes the data and creates several small chunks of data. Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer's job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.


2 Answers

What about deleting the directory before you run the job?

You can do this via shell:

hadoop fs -rmr /path/to/your/output/ 

or via the Java API:

// configuration should contain reference to your namenode FileSystem fs = FileSystem.get(new Configuration()); // true stands for recursively deleting the folder you gave fs.delete(new Path("/path/to/your/output"), true); 
like image 103
Thomas Jungblut Avatar answered Nov 06 '22 20:11

Thomas Jungblut


Jungblut's answer is your direct solution. Since I never trust automated processes to delete stuff (me personally), I'll suggest an alternative:

Instead of trying to overwrite, I suggest you make the output name of your job dynamic, including the time in which it ran.

Something like "/path/to/your/output-2011-10-09-23-04/". This way you can keep around your old job output in case you ever need to revisit in. In my system, which runs 10+ daily jobs, we structure the output to be: /output/job1/2011/10/09/job1out/part-r-xxxxx, /output/job1/2011/10/10/job1out/part-r-xxxxx, etc.

like image 37
Donald Miner Avatar answered Nov 06 '22 22:11

Donald Miner