Levenshtein distance based methods Vs Soundex

Question

As per this comment in a related thread, I'd like to know why Levenshtein distance based methods are better than Soundex.

Keith · Accepted Answer

I agree with you on Daitch-Mokotoff, Soundex is biased because the original US census takers wanted 'Americanized' names.

Maybe an example on the difference would help:

Soundex puts addition value in the start of a word - in fact it only considers the first 4 phonetic sounds. So while "Schmidt" and "Smith" will match "Smith" and "Wmith" won't.

Levenshtein's algorithm would be better for finding typos - one or two missing or replaced letters produces a high correlation, while the phonetic impact of those missing letters is less important.

I don't think either is better, and I'd consider both a distance algorithm and a phonetic one for helping users correct typed input.

erickson · Answer

I would suggest using Metaphone, not Soundex. As noted, Soundex was developed in the 19th century for American names. Metaphone will give you some results when checking the work of poor spellers who are "sounding it out", and spelling phonetically.

Edit distance is good at catching typos such as repeated letters, transposed letters, or hitting the wrong key.

Consider the application to decide which will fit your users best—or use both together, with Metaphone complementing the suggestions produced by Levenshtein.

With regard to the original question, I've used n-grams successfully in information retrieval applications.

Levenshtein distance based methods Vs Soundex

Tags:

algorithm

fuzzy-search

soundex

ColinYounger

2 Answers

Keith

erickson

Recent Activity

Donate For Us

Levenshtein distance based methods Vs Soundex

Tags:

algorithm

fuzzy-search

soundex

ColinYounger

2 Answers

Keith

erickson

Related questions

Recent Activity

Donate For Us