I see two different approaches to accessing information from the Carnegie Mellon Pronouncing Dictionary Corpus Reader (cmudict) in NLTK (via Python3) and am having a hard time understanding the difference between them:
from nltk.corpus import cmudict
pro1 = cmudict.entries()
from nltk.corpus import cmudict
pro2 = cmudict.dict()
According to the docs (here) cmudict.entries() returns "the cmudict lexicon as a list of entries containing (word, transcriptions) tuples" whereas cmudict.dict() returns "the cmudict lexicon as a dictionary, whose keys are lowercase words and whose values are lists of pronunciations".
However, if the difference between cmudict.entries() and cmudict.dict() is only a difference in returned data type (seems to be what the docs are indicating) why does calling len() on the data from each result in two different numbers (example below)?
from nltk.corpus import cmudict
pro1 = cmudict.entries()
pro2 = cmudict.dict()
output = ' '.join(["entries length is", str(len(pro1)), "dict length is", str(len(pro2))])
print(output)
which returns: entries length is 133737 dict length is 123455
Is there something I am misunderstanding about the difference between these two methods? Is cmudict.enries() somehow more complete?
The cmudict module returns all of CMUdict in one big dictionary via the dict() method, with each word as a key and a list of matching phonetic representations as the key value. The entries() method returns a large list of tuples, where each tuple consists of two elements: a word string and a list of one of the word's phonemes. Each tuple only represents a single phonetic representation of the word, so words with multiple phonetic representations have multiple tuple entries. As a result, the length of entries() is greater than the length of dict(), since dict() only has one entry (key/value pair) per word, and entries() may have multiple entries (tuples) per word.
Some exemplary code may help:
import cmudict
def lookup_word(word_s):
return cmudict.dict().get(word_s) # standard dict access
def lookup2_word(word_s):
entries = [e[1] for e in cmudict.entries() if e[0] == word_s]
return entries
def count_syllables(word_s):
count = 0
phones = lookup_word(word_s)
if phones:
phones0 = phones[0]
count = len([p for p in phones0 if p[-1].isdigit()])
return count
word_s = 'hello'
phones = lookup_word(word_s)
phones2 = lookup2_word(word_s)
count = count_syllables(word_s)
print(f"PHONES({word_s!r}) yields {phones}\nCOUNT is {count}")
print(f"PHONES are same: {phones == phones2}")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With