Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Deduplication algorithm for large number of contacts

I'm developing an application which must be able to find & merge duplicates in a Hundreds of thousands of contact information stored in sql server DB. I have to compare all the columns in the table, each column has a weight value. The comparison must work based on the weight value. Based on the comparison result & degree of equivalence i have to decide to merge the contacts automatically or request user attention. I know that there are number of fuzzy logic algorithms for deduplication.

Read about N-gram or Q-gram-based Algorithms in http://www.melissadata.com/. Is this algorithm feasible for a large set of data? If not can any one guide me with some algorithm or tel me where to start with?

An example of what i want to achieve,

Gonzales = Gonzalez (two different spelling of different name)
Smith = Smyth (Phonetic sound the same)
123 Main st = 123 Main street (abbrevation)
Bob Smith = Robert Smith (synonym)
like image 415
Shankar Avatar asked Oct 04 '13 11:10

Shankar


People also ask

Which process helps to de duplicate files from a larger set?

Data deduplication -- often called intelligent compression or single-instance storage -- is a process that eliminates redundant copies of data and reduces storage overhead.

What is the hash algorithm for deduplication?

Generally, the data deduplication technique identifies and eliminates duplicated data blocks with a cryptographic hash function. Hash-based data deduplication methods use a hashing algorithm to distinguish “chunks” of data individually. The frequently used algorithms are SHA-1 and MD5.

What is data deduplication and why is it important when a corporation has a large amount of data?

It allows for the storage of one unique instance of all the data within a database, without any copies needlessly taking up space. Once the redundant copies of data are removed, data deduplication gives you the option to compress the single copies of data that are stored to save even more space.


3 Answers

This whole area of research is generally known as record linkage (ironically, it has about a dozen duplicate names). There are quite a few tools out there that will let you configure matching for your specific data, churn through the data, and output the duplicates for you. Some tools will even create a matching for you, if you answer some questions about the correctness of potential matches.

Q/N-gram comparison (and indexing) is one possible way to solve this, but there is a whole raft of others. You'll quickly find that different types of comparators work well for different types of data. I haven't tried Q-gram indexing myself, but researchers in the field have described it to me as relatively slow.

As for comparison with phonetic key functions (like Soundex or Metaphone or whatever) this is only suitable when you have small text fields, like separate fields for given name, family name, middle name etc. Even then these functions tend to be rather coarse. And beware of Soundex. It not only is extremely coarse, turning very different names into the same key, but also misses a number of similar names that should be treated the same. Metaphone is much better.

The Wikipedia page for record linkage used to have a list of tools, but the editors removed it. I wrote an open source tool called Duke to solve this kind of thing. It has a genetic algorithm that can help you create a configuration, or you can write one manually. Other tools exist, too.

I would advise you to use one of the tools that already exist rather than attempting to solve this from scratch.

like image 172
larsga Avatar answered Oct 12 '22 19:10

larsga


I believe the best option for deduping names is by means of a phonetic encoder. A phonetic encoder will be able to dedup alternative spellings of the same name, here is an example with some common names:

Group: Kathryn names: [Kathryn, Katharine, Katherin, Katherynn, Kathrynn, Katherynne, Kathrynne, Catherine, Cathryn, Catharine, Catherin, Catherynn, Cathrynn, Catherynne, Cathrynne]
Group: Assaf names: [Assaf, Asaf]
Group: Megan names: [Megan, Meagan, Meghan, Meaghan]
Group: Allison names: [Allison, Alyson, Allyson, Alison, Allisyn]
==============================================================
Phonetic Encoder: Caverphone2
---- Names Group: Kathryn ----
Encoded names: {KTRN111111=16}
---- Names Group: Assaf ----
Encoded names: {ASF1111111=3}
---- Names Group: Megan ----
Encoded names: {MKN1111111=5}
---- Names Group: Allison ----
Encoded names: {ALSN111111=6}
==============================================================
Phonetic Encoder: DoubleMetaphone
---- Names Group: Kathryn ----
Encoded names: {K0RN=16}
---- Names Group: Assaf ----
Encoded names: {ASF=3}
---- Names Group: Megan ----
Encoded names: {MKN=5}
---- Names Group: Allison ----
Encoded names: {ALSN=6}
==============================================================
Phonetic Encoder: Nysiis
---- Names Group: Kathryn ----
Encoded names: {CATRYN=7, CATARA=6, CATARY=5}
---- Names Group: Assaf ----
Encoded names: {ASAF=3}
---- Names Group: Megan ----
Encoded names: {MAGAN=5}
---- Names Group: Allison ----
Encoded names: {ALASAN=3, ALYSAN=3, ALASYN=2}
==============================================================
Phonetic Encoder: Soundex
---- Names Group: Kathryn ----
Encoded names: {K365=8, C365=9}
---- Names Group: Assaf ----
Encoded names: {A210=3}
---- Names Group: Megan ----
Encoded names: {M250=5}
---- Names Group: Allison ----
Encoded names: {A425=6}
==============================================================
Phonetic Encoder: RefinedSoundex
---- Names Group: Kathryn ----
Encoded names: {C30609080=5, K3060908=5, K30609080=4, C3060908=5}
---- Names Group: Assaf ----
Encoded names: {A0302=3}
---- Names Group: Megan ----
Encoded names: {M80408=5}
---- Names Group: Allison ----
Encoded names: {A070308=6}
==============================================================

In the example you can see that for Caverphone and DoubleMetaphone all names were encoded to the same string. you should see what makes sense for your data, the encoder to use depends on the language and etymology of the names (i.e. english names, german names...)

like image 45
Asaf Avatar answered Oct 12 '22 19:10

Asaf


Found a partial solution using simhash algorithm. Found a good example here http://simhash.codeplex.com/

like image 24
Shankar Avatar answered Oct 12 '22 17:10

Shankar