Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get duplicate record in large file using MapReduce

I have a large file contain > 10 million line. I want to get dupplicate line using MapReduce. How can I solve this problem? Thanks for help

like image 697
Vi Ngo Van Avatar asked May 10 '26 19:05

Vi Ngo Van


1 Answers

You need to make use of the fact that the default behaviour of MapReduce is to group values based on a common key.

So the basic steps required are:

  1. Read in each line of you file to you mapper, probably using something like the TextInputFormat.
  2. Set the output Key (Text object) to the value of each line. The contents of the value doesn't really matter. You can just set it to a NullWritable if you want.
  3. In the reduce check the number of values grouped for each key. If you have more than one value you know you have a duplicate.
  4. If you just want the duplicate values, write out the keys that have multiple values.
like image 149
Binary Nerd Avatar answered May 14 '26 02:05

Binary Nerd