Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rhyme Dictionary from CMU pronunciation database

I'm looking for a free or open source rhyming database.

I've found the CMU pronunciation "database" and its series of apps but I can't make sense of them or figure out where the data's coming from.

A simple text file with the word and its phonemes is all I need.

Does anybody here know where I'd find one or where I would begin to derive such a list from the CMU files?

like image 335
Kevin Avatar asked Apr 04 '13 22:04

Kevin


2 Answers

cmudict

The cmudict is a text file and it's format is really simple. First, the word is listed. Then, there are two spaces. Everything following the two spaces is the pronunciation. Where a word may have two different ways of being spoken you will see two entries for the word like

word
word(1)

At the beginning of the file they've listed symbols and punctuation. The symbol is followed by the english spelling of said symbols name with no space between them. This is then followed by the two space divider and the arpabet code. Since you're only looking for rhymes you don't have to do anything special with the symbols section since you're never going to be looking for a rhyme to ...ELLIPSIS

ARPAbet

The information about how ARPAbet codes map to IPA is listed in wikipedia http://en.wikipedia.org/wiki/Arpabet and each mapping shows example words. It's pretty easy to see how the two relate to one another and that may help you to understand how to read the ARPAbet codes if you are familiar with IPA.

Summary

Basically, if you've already found the cmudict then you've already got what you asked for: a database of words and their pronunciations. To find words that rhyme you'll have to parse the flat file into a table and run a query to find words that end with the same ARPAbet code.

General Theory of Doing Stuff to Things

Part: Stuff

  1. create a new database
  2. create a table in the database with three fields: index, word, arpabet
  3. read the cmudict file line by line
  4. for each line split it into two parts where two consecutive spaces are found AND
  5. increment the index count, then insert the index number, word, and arpabet code

Then Umm...

Once you've got the data into whatever kind of database you choose, you can then use that database to find correlations between the arpabet codes. You could find rhymes, consonance, assonance, and other mnemonic devices. It would go something like

Part: Thing

  1. get a word you want to find a rhyme for
  2. query the database for the arpabet equivalent of the word
  3. split the arpabet code into pieces by breaking it up everywhere there is a space
  4. take the last piece of the code and, query the database for words whose arpabet codes end matches said piece
  5. Do fancy things with the rhymes

Shortcuts and Spoilers

I got bored and wrote a Node.js module that covers "Part: Stuff" listed above. If you've got Node.js installed on your machine you can get the module by running npm install cmudict-to-sqlite See https://npmjs.org/package/cmudict-to-sqlite for the README or just look in the module for docs.

like image 106
Kastor Avatar answered Oct 18 '22 04:10

Kastor


Rhyme Logic using CMU Pronouncing Dictionary

OK. Suppose you want to use CMU Pronouncing Dictionary data (example file: cmudict-0.7b) to build a list of all the words that rhyme with "LOVE".

Here's how you might do it:

First, you need to learn the pronunciation of "LOVE". You'll find this line in the dictionary, where "LOVE" and "L AH1 V" are separated by two spaces:

LOVE  L AH1 V

This is saying that the word LOVE is pronounced like L AH1 V.

Then, find the vowel phoneme that has primary stress. In other words, look for the number "1" in that pronunciation. The text directly to the left of the 1 is the vowel sound that has primary stress (AH). That text, and everything to the right of it are your "rhyme phonemes" (for the lack of a better term). So the rhyme phonemes for LOVE are AH1 V.

We're half done! Now we just have to find other words whose pronunciations end with AH1 V. If you're playing along in Notepad++, try a Find All In Current Document for pattern AH1 V$ using Search Mode of "Regular expression". This will match lines like:

Line 392: ABOVE  AH0 B AH1 V
Line 10266: BELOVE  B IH0 L AH1 V
Line 30204: DENEUVE  D IH0 N AH1 V
Line 30205: DENEUVE(1)  D IY0 N AH1 V
Line 34064: DOVE  D AH1 V
Line 48177: GLOVE  G L AH1 V
Line 49053: GOV  G AH1 V
... etc

Rhyming woooooords!

There are plenty of ways to implement this, and plenty of corner cases, but this is roughly the approach that many electronic rhyming dictionaries appear to take when finding perfect rhymes.

Hypothetical SQL approach to storing rhyme data

Obviously, performance will be a problem if you just scan the dictionary every time someone wants a rhyme. If that's a concern, you might try storing or indexing the data differently.

Although it's not the most efficient on disk space, I've had a good experience storing this stuff in a SQL table with indexed columns.

For a simple conceptual example, you could compute the "rhyme phonemes" of all words in the dictionary, then insert them into a "Rhymes" table whose columns are { WordText, RhymePhonemes }. For example, you might see records like:

{"ABOVE", "AH1 V"}
{"DOVE", "AH1 V"}
{"OUTLIVE", "IH1 V"}
{"GRADUATE", "AE1 JH AH0 W AH0 T"}
{"GRADUATE", "AE1 JH AH0 W EY2 T"}

... etc

Then, to find rhymes, you'd issue a query like:

SELECT OTHER.WordText
FROM Rhymes INPUT
     INNER JOIN Rhymes OTHER ON OTHER.RhymePhonemes = INPUT.RhymePhonemes
WHERE INPUT.WordText = 'love' AND
      OTHER.WordText <> INPUT.WordText
ORDER BY OTHER.WordText

This also comes in handy if you're planning on printing a dictionary where all similar-sounding words are grouped together.

There are of course plenty of other ways to store/search the data of varying trade-offs, but hopefully this gets you started.

I've also had some luck storing the raw pronunciation in the database in varying "full" formats (forward and reversed strings of the pronunciation, with stress marks and without stress marks, etc) but not "chopped" into specific pieces like a rhyme-phoneme column.

Gotchas

Again, the original explanation with "love" will absolutely get you in the ballpark of rhyming. However, along the way you'll probably run into other gotchas to consider. Here's a heads-up:

  1. Some words have multiple pronunciations. In the CMU dictionary, the alternate pronunciations are marked with text like (1), (2), etc following the word as in GRADUATE(2). If someone wants a rhyme of these words, you have to decide between showing rhymes of ALL matched pronunciations, or having the user choose which pronunciation they really meant.
  2. What do you do when the pronunciation has two or more "1"s? Pick the first one? Pick the last one? If you pick the last one, you'll find more rhymes, but it might not be the most natural choice of stress.
  3. What do you do when the pronunciation has no "1"s? It doesn't happen a lot, but it happens, like: ACCREDIT AH0 K R EH2 D AH0 T and AIKIN EY0 K IH0 N. In this case I'd pick the next best stress (e.g. pick the 2, 3, 4, etc if the 1 is absent). If they're all 0's, I don't have any good advice.
  4. Some pronunciations are missing. It's a great start, but it doesn't have all the words or spellings of words you might want. US spelling is preferred over UK spelling.
  5. Some pronunciations are not what you'd expect, and you may want to prune. For example there's a pronunciation of "or" that sounds like "er".
  6. You may want to compare the "rhyme phonemes" with stress marks removed. This only matters for words whose primary stress is not on the last vowel (so you don't see the problem on the "love" example).
like image 24
Plate Avatar answered Oct 18 '22 03:10

Plate