Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Phonetic search for Indian languages

Tags:

I want to compare strings phonetically in my android app. But the special case here is, I want to compare Indian language words written in English. For example, I want to check if "Edhu" "Adhu" "Yethu" are phonetically equal, they all mean the same in Tamil language. But people who use English script to write Indian languages use different spellings to make the word. How do I compare words in this case?

I tried out Levenshtein. But I am not sure how to convert the number it returns to the equality.

I tried out Soundex, Soundex codes are not the same when the first letter of the word changes. But it is able to figure out the similar sounding parts. I don't understand how it works.

 soundex.encode("Yethu")  (soundex.encode("Edhu"))  (soundex.encode("adhu")) 
 Y300                       E300                       A300
like image 881
55597 Avatar asked Jun 15 '15 10:06

55597


1 Answers

As I understand it you want to take words written in English, decompose them phonetically, and then group together words that are spelled differently, but have the same Phonetic representations.

For this SoundEx is a 90% solution, provided that the people who are spelling the words in English are actually using the correct consonants when they are translating the words from Tamil to English.

You should be able just to drop the first value from the SoundEx representation and use that as your encoding when the first letter is a vowel.

The reason is that SoundEx ( https://en.wikipedia.org/wiki/Soundex ) performs its encodings only on the consonants in the words that it is presented with. It throws away all the vowels plus h and w - Unless - the Vowel is the first letter in the word - which explains why your values are all slightly different, but only in the first letter's encoding.

As for your zeros, SoundEx encodings are by definition 1 letter and 3 numbers( 1 through 6 only), you only have 1 consonant in each word (d or t) and SoundEx maps both of them to the number 3. since there are no more consonants, I believe it adds 2 zeros for compliance. thus you get Letter300

If you are going to continue to use SoundEx for your app you should bare in mind that it can only give you 26*6*6*6 = 5616 unique encodings based on its Letter Number(1-6) Number(1-6) Number(1-6) scheme. Which means that the phonetic encodings will not be unique and some words that are radically different will have SoundEx encodings that collide.

like image 83
Semicolons and Duct Tape Avatar answered Sep 21 '22 08:09

Semicolons and Duct Tape