get duplicate record in large file using MapReduce

Question

I have a large file contain > 10 million line. I want to get dupplicate line using MapReduce. How can I solve this problem? Thanks for help

Binary Nerd · Accepted Answer

You need to make use of the fact that the default behaviour of MapReduce is to group values based on a common key.

So the basic steps required are:

Read in each line of you file to you mapper, probably using something like the TextInputFormat.
Set the output Key (Text object) to the value of each line. The contents of the value doesn't really matter. You can just set it to a NullWritable if you want.
In the reduce check the number of values grouped for each key. If you have more than one value you know you have a duplicate.
If you just want the duplicate values, write out the keys that have multiple values.

Donate For Us