I want to overwrite/reuse the existing output directory when I run my Hadoop job daily. Actually the output directory will store summarized output of each day's job run results. If I specify the same output directory it gives the error "output directory already exists".
How to bypass this validation?
What will happen if the output directory already exists for a MapReduce job? The job will overwrite the files from that directory and store the output generated in the directory.
Yes, it is possible to have the output of Hadoop MapReduce Job written to multiple directories.
The mapper processes the data and creates several small chunks of data. Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer's job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
What about deleting the directory before you run the job?
You can do this via shell:
hadoop fs -rmr /path/to/your/output/
or via the Java API:
// configuration should contain reference to your namenode FileSystem fs = FileSystem.get(new Configuration()); // true stands for recursively deleting the folder you gave fs.delete(new Path("/path/to/your/output"), true);
Jungblut's answer is your direct solution. Since I never trust automated processes to delete stuff (me personally), I'll suggest an alternative:
Instead of trying to overwrite, I suggest you make the output name of your job dynamic, including the time in which it ran.
Something like "/path/to/your/output-2011-10-09-23-04/
". This way you can keep around your old job output in case you ever need to revisit in. In my system, which runs 10+ daily jobs, we structure the output to be: /output/job1/2011/10/09/job1out/part-r-xxxxx
, /output/job1/2011/10/10/job1out/part-r-xxxxx
, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With