Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the Soundex function work in SQL Server?

Here's an example of Soundex code in SQL:

SELECT SOUNDEX('Smith'), SOUNDEX('Smythe');
----- ----- 
S530  S530  

How does 'Smith' become S530?

In this example, the first digit is S because that's the first character in the input expression, but how are the remaining three digits are calculated?

like image 619
Naveen Suman Avatar asked Sep 03 '11 20:09

Naveen Suman


2 Answers

Take a look a this article

The first letter of the code corresponds to the first letter of the name. The remainder of the code consists of three digits derived from the syllables of the word according to the following code:

  • 1 = B, F, P, V
  • 2 = C, G, J, K, Q, S, X, Z
  • 3 = D, T
  • 4 = L
  • 5 = M,N
  • 6 = R

The double letters with the same Soundex code, A, E, I, O, U, H, W, Y, and some prefixes are being disregarded...

So for Smith and Smythe the code is created like this:

S  S   ->   S
m  m   ->   5
i  y   ->   0
t  t   ->   3
h  h   ->   0
   e   ->   -
like image 101
Dmitry Avatar answered Oct 27 '22 14:10

Dmitry


What is Soundex?

Soundex is:

a phonetic algorithm for indexing names by sound, as pronounced in English; first developed by Robert C. Russell and Margaret King Odell in 1918

How does it Work?

There are several implementations of Soundex, but most implement the following steps:

  1. Retain the first letter of the name and drop all other occurrences of vowels and h,w:
    |a, e, i, o, u, y, h, w | → "" |
  2. Replace consonants with numbers as follows (after the first letter):
    | b, f, p, v | → 1 |
    | c, g, j, k, q, s, x, z | → 2 |
    | d, t | → 3 |
    | l | → 4 |
    | m, n | → 5 |
    | r | → 6 |
  3. Replace identical adjacent numbers with a single value (if they were next to each other prior to step 1):
    | M33 | → M3 |
  4. Cut or Pad with zeros or cut to produce a 4 digit result:
    | M3 | → M300 |
    | M34123 | → M341 |

Here's an interactive demo in jsFiddle:

jsFiddle Demo

And here's a demo in SQL using SQL Fiddle

In SQL Server, SOUNDEX is often used in conjunction with DIFFERENCE, which is used to score how many of the resulting digits are identical (just like the game mastermind), with higher numbers matching most closely.

Mastermind

What are the Alternatives?

It's important to understand the limitations and criticisms of soundex and where people have tried to improve it, notably only being rooted in English pronunciation and also discards a lot of data, resulting in more false positives.


Both Metaphone & Double Metaphone still focus on English pronunciations, but add much more granularity to the nuances of speech in Enlgish (ie. PHF)

Phil Factor wrote a Metaphone Function in SQL with the source on github


Soundex is most commonly used on identifying similar names, and it'll have a really hard time finding any similar nicknames (i.e. RobertRob or Bob). Per this question on a Database of common name aliases / nicknames of people, you could incorporate a lookup against similar nicknames as well in your matching process.

Here are a couple free lists of common nicknames:

  • SOEMPI - name_to_nick.csv | Github
  • carltonnorthern - names.csv | Github

Further Reading:

  • Fuzzy matching using T-SQL
  • SQL Server – Do You Know Soundex Functions?
like image 4
KyleMit Avatar answered Oct 27 '22 15:10

KyleMit