I have about 2 million records which have about 4 string fields each which needs to be checked for duplicates. To be more specific I have name, phone, address and fathername as fields and I must check for dedupe using all these fields with rest of data. The resulting unique records need to be noted into db.
I have been able to implement mapreduce, iterarate of all records. Task rate is set to 100/s and bucket-size to 100. Billing enabled.
Currently, everything is working, but performance is very very slow. I have been able to complete only 1000 records dedupe processing among a test dataset of 10,000 records in 6 hours.
The current design in java is:
I am ready to increase any amount of GAE resources to achieve this in shortest time.
My Questions are:
Freelancers most welcome to assist in this.
Thanks for your help.
You should take advantage of the Reducer to do the equivalent of a sort -u for each field. You'll need to do one M/R job per field. You would make the field you are comparing the key in the mapper, then in the reducer you'd get all of the records with the same name grouped together and you could mark them. The second pass would be for the phone, etc. Depending on your cluster size each pass should be very fast.
Edit: @Olaf pointed out the OP probably wants totally unique records. Using a multipart key this could be a one-line hadoop streaming command to get the unique set. I'll add that soon.
Edit2: Promised streaming command that will perform a sort -u on the entire file. This assumes you have a file with the records with each field (name, fathername, phone number and address) one per line tab delimited in one or more files in the dir hdfs://example/dedup/input/. The actual hdfs path can be anything, or you could use a single file. The output will be multiple part-* files in hdfs://example/dedup/output/. You also might need to change the command as your hadoop-streaming.jar might be in a slightly different place. If you have more than 4 fields change the value of stream.num.map.output.key.fields.
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input hdfs://example/dedup/input/ -output hdfs://example/dedup/output/ \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /usr/bin/uniq \
-D stream.num.map.output.key.fields=4
To retrieve the unique results to a file in the local filesystem file run this:
$HADOOP_HOME/bin/hadoop fs -cat \
'hdfs://example/dedup/output/part-*' > results.txt
One note is that as every column is a key streaming will add a null value so each row will have an extra tab at the end. That is easily stripped off.
If you want to do more than just get the uniq output you could put your own java class or command line program rather than use /usr/bin/uniq. That class could, for example, update all records that you find are duplicated by adding a fifth column in your input that is the records DB ID. Hadoop by default partitions results by the whole key so each group of duplicate records will be streamed together a reducer, and this will all happen in parallel. Please take a look at the streaming documentation for more info.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With