Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The most efficient way to implement a phonetic search

What is the most efficient way to implement a phonetic search in C++ and/or Java? By phonetic search I mean substituting vowels or consonants that sound similar. This would be especially useful for names because sometimes people's names have sort of strange spellings.

I am thinking it might be effective to substitue vowels and some consonants. It may also be good to include some special cases like silent E's at the end or F and PH. Would it be best to use cstrings or strings in C++? Would it be better to store a copy in memory with the substituted values or call a function every time we look for something?

like image 693
ctype.h Avatar asked Dec 02 '11 16:12

ctype.h


People also ask

What is a phonetic search?

Phonetic Search Technology (PST) is a method of speech recognition. An audio signal of speech is broken down into series of phonemes, which can be used to identify words.

How does Double Metaphone work?

Definition: An algorithm to code English words (and foreign words often heard in the United States) phonetically by reducing them to a combination of 12 consonant sounds. It returns two codes if a word has two plausible pronunciations, such as a foreign word.

Which algorithm is the algorithm for phonetic hashing?

Algorithms for such phonetic hashing are commonly collectively known as soundex algorithms. However, there is an original soundex algorithm, with various variants, built on the following scheme: Turn every term to be indexed into a 4-character reduced form.


2 Answers

Besides Soundex you'll find also the Metaphone or Double Metaphone phonetic algorithm, which seem to be an improvement for the English pronunciation and is a quite new algorithm.

For the german pronunciation I use the "Kölner Phonetik".

Apache Commons Codec gives you a very simple Java implementation of those basic algorithms (Soundex, Metaphone, ...) http://commons.apache.org/codec/ For example see the javadoc for the soundex: http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/Soundex.html

Just by typing following code you the the phonetic value of your String:

Soundex soundex = new Soundex();
String phoneticValue = soundex.encode("YourString");

And then you can simply do it for two strings and compare the phonetic values. Hava a look at the following post if you're comparing two strings, because the equals() methods is just black and white, and maybe you'd like to know how many % it is matching:

How to compare almost similar Strings in Java? (String distance measure)

like image 50
FiveO Avatar answered Sep 27 '22 20:09

FiveO


Soundex along with its variants is the standard algorithm for this. It uses phonetic rules to transform the name into an alphanumeric code. Names with the same code are grouped together.

As far as implementing the search, I'd use a data structure that maps each soundex code to the list of names that have that code. Depending on the data structure used (a hash table or a tree), the lookup could be done in time that is either constant on logarithmic in the number of distinct soundex codes.

I am not sure what exactly you mean by cstring (Microsoft's CString?) but the standard std::string class will be perfectly fine for this problem and would be my preferred choice.

like image 26
NPE Avatar answered Sep 27 '22 20:09

NPE