Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Estimate Phonemic Similarity Between Two Words

I am working on detecting rhymes in Python using the Carnegie Mellon University dictionary of pronunciation, and would like to know: How can I estimate the phonemic similarity between two words? In other words, is there an algorithm that can identify the fact that "hands" and "plans" are closer to rhyming than are "hands" and "fries"?

Some context: At first, I was willing to say that two words rhyme if their primary stressed syllable and all subsequent syllables are identical (c06d if you want to replicate in Python):

def create_cmu_sound_dict():

    final_sound_dict = {}

    with open('resources/c06d/c06d') as cmu_dict:
        cmu_dict = cmu_dict.read().split("\n")
        for i in cmu_dict:
            i_s = i.split()
            if len(i_s) > 1:
                word = i_s[0]
                syllables = i_s[1:]

                final_sound = ""
                final_sound_switch = 0

                for j in syllables:
                    if "1" in j:
                        final_sound_switch = 1
                        final_sound += j
                    elif final_sound_switch == 1:
                        final_sound += j

            final_sound_dict[word.lower()] = final_sound

    return final_sound_dict

If I then run

print cmu_final_sound_dict["hands"]
print cmu_final_sound_dict["plans"]

I can see that hands and plans sound very similar. I could work towards an estimation of this similarity on my own, but I thought I should ask: Are there sophisticated algorithms that can tie a mathematical value to this degree of sonic (or auditory) similarity? That is, what algorithms or packages can one use to mathematize the degree of phonemic similarity between two words? I realize this is a large question, but I would be most grateful for any advice others can offer on this question.

like image 482
duhaime Avatar asked Oct 20 '14 21:10

duhaime


1 Answers

Cheat.

#!/usr/bin/env python

from Levenshtein import *

if __name__ == '__main__':
    s1 = ['HH AE1 N D Z', 'P L AE1 N Z']
    s2 = ['HH AE1 N D Z', 'F R AY1 Z']
    s1nospaces = map(lambda x: x.replace(' ', ''), s1)
    s2nospaces = map(lambda x: x.replace(' ', ''), s2)
    for seq in [s1, s2, s1nospaces, s2nospaces]:
        print seq, distance(*seq)

Output:

['HH AE1 N D Z', 'P L AE1 N Z'] 5
['HH AE1 N D Z', 'F R AY1 Z'] 8
['HHAE1NDZ', 'PLAE1NZ'] 3
['HHAE1NDZ', 'FRAY1Z'] 5

Library: https://pypi.python.org/pypi/python-Levenshtein/0.11.2

Seriously, however, since you only have text as input and pretty much the text-based CMU dict, you're limited to some sort of manipulation of the text input; but the way I see it, there's only a limited number of phonems available, so you could take the most important ones and assign "phonemic weights" to them. There's only 74 of them in the CMU dictionary you pointed to:

 % cat cmudict.06.txt | grep -v '#' | cut -f 2- -d ' ' | tr ' ' '\n' | sort | uniq | wc -l
 75

(75 minus one for empty line)

You'd probably get better results if you've done smth more advanced in step 2: assign weights to particular phonem combinations. Then you could modify some Levenshtein-type distance metric, e.g. in the library above, to come up with reasonably performing "phonemic distance" metric working on text inputs.

Not much work for step 3: profit.

like image 82
LetMeSOThat4U Avatar answered Sep 25 '22 08:09

LetMeSOThat4U