I've written a mapreduce program in Java, which I can submit to a remote cluster running in distributed mode. Currently, I submit the job using the following steps:
myMRjob.jar
)hadoop jar myMRjob.jar
I would like to submit the job directly from Eclipse when I try to run the program. How can I do this?
I am currently using CDH3, and an abridged version of my conf is:
conf.set("hbase.zookeeper.quorum", getZookeeperServers());
conf.set("fs.default.name","hdfs://namenode/");
conf.set("mapred.job.tracker", "jobtracker:jtPort");
Job job = new Job(conf, "COUNT ROWS");
job.setJarByClass(CountRows.class);
// Set up Mapper
TableMapReduceUtil.initTableMapperJob(inputTable, scan,
CountRows.MyMapper.class, ImmutableBytesWritable.class,
ImmutableBytesWritable.class, job);
// Set up Reducer
job.setReducerClass(CountRows.MyReducer.class);
job.setNumReduceTasks(16);
// Setup Overall Output
job.setOutputFormatClass(MultiTableOutputFormat.class);
job.submit();
When I run this directly from Eclipse, the job is launched but Hadoop cannot find the mappers/reducers. I get the following errors:
12/06/27 23:23:29 INFO mapred.JobClient: map 0% reduce 0%
12/06/27 23:23:37 INFO mapred.JobClient: Task Id : attempt_201206152147_0645_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: com.mypkg.mapreduce.CountRows$MyMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996)
at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:212)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:602)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
...
Does anyone know how to get past these errors? If I can fix this, I can integrate more MR jobs into my scripts which would be awesome!
If you're submitting the hadoop job from within the Eclipse project that defines the classes for the job then you most probably have a classpath problem.
The job.setjarByClass(CountRows.class)
call is finding the class file on the build classpath, and not in the CountRows.jar (which may or may not have been built yet, or even on the classpath).
You should be able to assert this is true by printing out the result of job.getJar()
after you call job.setjarByClass(..)
, and if it doesn't display a jar filepath, then it's found the build class, rather than the jar'd class
What worked for me was exporting a runnable JAR (the difference between it and a JAR is that the first defines the class, which has the main method) and selecting the "packaging required libraries into JAR" option (choosing the "extracting..." option leads to duplicate errors and it also has to extract the class files from the jars, which, ultimately, in my case, resulted in not resolving the class not found exception).
After that, you can just set the jar, as was suggested by Chris White. For Windows it would look like this: job.setJar("C:\\\MyJar.jar");
If it helps anybody, I made a presentation on what I learned from creating a MapReduce project and running it in Hadoop 2.2.0 in Windows 7 (in Eclipse Luna)
I have used this method from the following website to configure a Map/Reduce project of mine to run the project using Eclipse (w/o exporting project as JAR) Configuring Eclipse to run Hadoop Map/Reduce project
Note: If you decide to debug you program, your Mapper
class and Reducer
class won't be debug-able.
Hope it helps. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With