Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate a list of English words containing consecutive consonant sounds

Tags:

algorithm

nlp

Start with this:

[G|C] * [T] *

Write a program that generates this:

Cat
Cut
Cute
City <-- NOTE: this one is wrong, because City has an "ESS" sound at the start.
Caught
...
Gate
Gotti
Gut
...
Kit
Kite
Kate
Kata
Katie

Another Example, This:

[C] * [T] * [N]

Should produce this:

Cotton Kitten

Where should I start my research as I figure out how to write a program/script that does this?

like image 784
dreftymac Avatar asked Dec 10 '22 17:12

dreftymac


1 Answers

You can do this by using regular expressions against a dictionary containing phonetic versions of words.

Here's an example in Javascript:

     <html>
    <head>
        <title>Test</title>
        <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"></script>
        <script>

            $.get('cmudict0.3',function (data) {
                matches = data.match(/^(\S*)\s+K.*\sT.*\sN$/mg);
                $('body').html('<p>'+matches.join('<br/> ')+'</p>');
            })

        </script>
    </head>
    <body>
    </body>
</html>

You'll need to download the list of all words from http://icon.shef.ac.uk/Moby/mpron.tar.Z and put it (uncompressed) in the same folder as the HTML file. I've only translated the [C] * [T] * [N] version into a regular expression and the output isn't very nice but it'll give you the idea. Here's a sample of the output:

CALTON K AE1 L T AH0 N
CAMPTON K AE1 M P T AH0 N
CANTEEN K AE0 N T IY1 N
CANTIN K AA0 N T IY1 N
CANTLIN K AE1 N T L IH0 N
CANTLON K AE1 N T L AH0 N
...
COTTERMAN K AA1 T ER0 M AH0 N
COTTMAN K AA1 T M AH0 N
COTTON K AA1 T AH0 N
COTTON(2) K AO1 T AH0 N
COULSTON K AW1 L S T AH0 N
COUNTDOWN K AW1 N T D AW2 N
..
KITSON K IH1 T S AH0 N
KITTELSON K IH1 T IH0 L S AH0 N
KITTEN K IH1 T AH0 N
KITTERMAN K IH1 T ER0 M AH0 N
KITTLESON K IH1 T L IH0 S AH0 N
...
like image 145
Rich Avatar answered Dec 13 '22 11:12

Rich