Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop Mapreduce multiple Input files

So I need two files as an Input to my mapreduce program: City.dat and Country.dat

In my main method im parsing the command line arguments like this:

Path cityInputPath = new Path(args[0]);
Path countryInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);

If I'm now running my programm with the following command:

hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat /user/cloudera/capital/input/Country.dat /user/cloudera/capital/output

I get the following error:

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/cloudera/capital/input/Country.dat already exists

Why does it treat this as my output directory? I specified another directory as the output directory. Can somebody explain this?

like image 324
gaussd Avatar asked Nov 05 '12 17:11

gaussd


1 Answers

Based on the stacktrace, your output directory is not empty. So the simplest thing is actually to delete it before running the job:

bin/hadoop fs -rmr /user/cloudera/capital/output

Besides that, your arguments starting with the classname of your main class org.myorg.Capital. So that is the argument on the zero'th index. (Based on the stacktrace and the code you have provided).

Basically you need to shift all your indices one to the right:

Path cityInputPath = new Path(args[1]);
Path countryInputPath = new Path(args[2]);
Path outputPath = new Path(args[3]);
MultipleInputs.addInputPath(job, countryInputPath, TextInputFormat.class, JoinCountryMapper.class);
MultipleInputs.addInputPath(job, cityInputPath, TextInputFormat.class, JoinCityMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);

Don't forget to clear your output folder though!

Also a small tip for you, you can separate the files with comma "," so you can set them with a single call like this:

hadoop jar capital.jar org.myorg.Capital /user/cloudera/capital/input/City.dat,/user/cloudera/capital/input/Country.dat

And in your java code:

FileInputFormat.addInputPaths(job, args[1]);
like image 138
Thomas Jungblut Avatar answered Sep 27 '22 21:09

Thomas Jungblut