Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to recognize similar words with difference in spelling

I want to filter out duplicate customer names from a database. A single customer may have more than one entry to the system with the same name but with little difference in spelling. So here is an example: A customer named Brook may have three entries to the system with this variations:

  1. Brook Berta
  2. Bruck Berta
  3. Biruk Berta

Let's assume we are putting this name in one database column. I would like to know the different mechanisms to identify such duplications form say a 100,000 records. We may use regular expressions in C# to iterate through all records or some other pattern matching technique or we may export these records to what ever best fits for such queries (SQL with Regular Expression capabilities)).

This is what I thought as a solution

  • Write a C# code to iterate through each record
  • Get only the Consonant letters in order (in the above case: BrKBrt)
  • Search for the same Consonant pattern from the other records considering similar sounding letters like (C,K) (C,S), (F, PH)

So please forward any ideas.

like image 910
Elias Haileselassie Avatar asked Jun 22 '10 07:06

Elias Haileselassie


People also ask

Can you give another example of words with the same sound but different in spelling?

Homophones are words that have a different spelling, different meaning but the same pronunciation. For example: I went to the sea to see my friend. The words sea and see have the same pronunciation but different meanings and spellings.

How do you find similar words?

Using the thesaurus, you can look up synonyms (different words with the same meaning) and antonyms (words with the opposite meaning). Tip: In the desktop versions of Word, PowerPoint, and Outlook, you can get a quick list of synonyms by right-clicking a word and choosing Synonyms.

What are words with different spelling but same meaning?

We call these words homophones, homographs or homonyms.


2 Answers

The Double Metaphone algorithm, published in 2000, is a new and improved version of the Soundex algorithm that was patented in 1918.

The article has links to Double Metaphone implementations in many languages.

like image 85
Ray Burns Avatar answered Nov 14 '22 23:11

Ray Burns


Have a look at Soundex

There is a Soundex function in Transact-SQL (see http://msdn.microsoft.com/en-us/library/ms187384.aspx):

SELECT 
SOUNDEX('brook berta'),
SOUNDEX('Bruck Berta'),
SOUNDEX('Biruk Berta')

returns the same value B620 for each of the example values

like image 36
Mario Menger Avatar answered Nov 14 '22 21:11

Mario Menger