cmudict.dict() vs cmudict.entries() (Python3, NLTK)

Question

I see two different approaches to accessing information from the Carnegie Mellon Pronouncing Dictionary Corpus Reader (cmudict) in NLTK (via Python3) and am having a hard time understanding the difference between them:

Version 1

from nltk.corpus import cmudict
pro1 = cmudict.entries()

Version 2

from nltk.corpus import cmudict
pro2 = cmudict.dict()

According to the docs (here) cmudict.entries() returns "the cmudict lexicon as a list of entries containing (word, transcriptions) tuples" whereas cmudict.dict() returns "the cmudict lexicon as a dictionary, whose keys are lowercase words and whose values are lists of pronunciations".

However, if the difference between cmudict.entries() and cmudict.dict() is only a difference in returned data type (seems to be what the docs are indicating) why does calling len() on the data from each result in two different numbers (example below)?

from nltk.corpus import cmudict

pro1 = cmudict.entries()
pro2 = cmudict.dict()

output = ' '.join(["entries length is", str(len(pro1)), "dict length is", str(len(pro2))])
print(output)

which returns: entries length is 133737 dict length is 123455

Is there something I am misunderstanding about the difference between these two methods? Is cmudict.enries() somehow more complete?

Gary02127 · Accepted Answer

The cmudict module returns all of CMUdict in one big dictionary via the dict() method, with each word as a key and a list of matching phonetic representations as the key value. The entries() method returns a large list of tuples, where each tuple consists of two elements: a word string and a list of one of the word's phonemes. Each tuple only represents a single phonetic representation of the word, so words with multiple phonetic representations have multiple tuple entries. As a result, the length of entries() is greater than the length of dict(), since dict() only has one entry (key/value pair) per word, and entries() may have multiple entries (tuples) per word.

Some exemplary code may help:

import cmudict

def lookup_word(word_s):
    return cmudict.dict().get(word_s)        # standard dict access

def lookup2_word(word_s):
    entries = [e[1] for e in cmudict.entries() if e[0] == word_s]
    return entries

def count_syllables(word_s):
    count = 0
    phones = lookup_word(word_s)
    if phones:
        phones0 = phones[0]
        count = len([p for p in phones0 if p[-1].isdigit()])
    return count

word_s = 'hello'
phones = lookup_word(word_s)
phones2 = lookup2_word(word_s)
count = count_syllables(word_s)
print(f"PHONES({word_s!r}) yields {phones}
COUNT is {count}")
print(f"PHONES are same: {phones == phones2}")

cmudict.dict() vs cmudict.entries() (Python3, NLTK)

Tags:

python

python-3.x

nltk

Version 1

Version 2

caseyanderson

1 Answers

Gary02127

Recent Activity

Donate For Us

cmudict.dict() vs cmudict.entries() (Python3, NLTK)

Tags:

python

python-3.x

nltk

Version 1

Version 2

caseyanderson

1 Answers

Gary02127

Related questions

Recent Activity

Donate For Us