I'm retrofitting a bunch of existing Hadoop unit tests that were previously run in an in-memory cluster (Using MiniMRCluster) into MRUnit. The existing test cases essentially provide input to the Map phase and then test the output from the Reduce phase.
I have three questions, and the best answer to any of them will qualify:
1) What do I lose, architecturally, by unit testing with MRUnit instead of an in-memory cluster?
2) Is it worthwhile to break the existing test cases up into Map-only tests and Reduce-only tests or not? Are there any cases where I would have to break them up?
3) Are there any testing scenarios that MRUnit is unable to cover?
According to cloudera website “MRUnit helps bridge the gap between MapReduce programs and JUnit by providing a set of interfaces and test harnesses, which allow MapReduce programs to be more easily tested using standard tools and practices.”.
You can use JUnit for both unit and integration testing and it also supports Java 8 features. Btw, if you are a completely new in the unit testing world, particularly in Java unit testing then this JUnit and Mockito crash course is a good starting point.
Sort phase in MapReduce covers the merging and sorting of map outputs. Data from the mapper are grouped by the key, split among reducers and sorted by the key. Every reducer obtains all values associated with the same key. Shuffle and sort phase in Hadoop occur simultaneously and are done by the MapReduce framework.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
The retrofitting process has taught me some potential answers, which I'm going to post here. I would still prefer to hear what others have to say, though, so I won't accept this answer.
1) I lose at least two things. First, the MR plumbing is mocked. So, there is a chance that some of the 'mocking' hides a problem that may exist in the MR job. Second, an MR job consists of the input from the file system and the output to the file system, in addition to partitioning and ordering between the map and reduce phase. MRUnit doesn't completely handle these aspects of Hadoop, so if an MR job depends on these functions, they can't be tested. It is still possible to rewrite the tests to test just the Map/Reduce parts, though.
2) For the most part, it isn't worthwhile to break up existing tests. If an existing test depends on a partitioner, for example, then it may make sense to break up the test so that the Map and Reduce can be tested without the partitioner involved. In general, though, it isn't worth doing "just to do it."
3) Yes -- Partitioners for one. Output formats for another. This may not be quite as big a deal for some people, but many of our existing jobs rely on these two features and since the unit tests are against the final output from the the output format, I'm having to rewrite quite a few tests to get them to work.
[edit]
just read a blog post from Cloudera that goes to the answer as well:
http://www.cloudera.com/blog/2009/07/debugging-mapreduce-programs-with-mrunit/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With