Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Words generated from Text.similar() and ContextIndex.similar_words() in NLTK sorted by frequency?

Tags:

python

nltk

I'm using these two functions to find similar words and they return different lists. I'm wondering if these functions are sorted by most to least frequent association?

like image 686
user1971414 Avatar asked Oct 06 '22 09:10

user1971414


1 Answers

ContextIndex.similar_words(word) calculates the similarity score for each word as the sum of the products of frequencies in each context. Text.similar() simply counts the number of unique contexts the words share.

similar_words() seems to contain a bug in NLTK 2.0. See the definition in nltk/text.py:

def similar_words(self, word, n=20):
    scores = defaultdict(int)
    for c in self._word_to_contexts[self._key(word)]:
        for w in self._context_to_words[c]:
            if w != word:
                print w, c, self._context_to_words[c][word], self._context_to_words[c][w]
                scores[w] += self._context_to_words[c][word] * self._context_to_words[c][w]
    return sorted(scores, key=scores.get)[:n]

The returned word list should be sorted in descending order of similarity score. Replace the return statement with:

return sorted(scores, key=scores.get)[::-1][:n]

In similar(), the call to similar_words() is commented out, perhaps due to this bug.

def similar(self, word, num=20):
    if '_word_context_index' not in self.__dict__:
        print 'Building word-context index...'
        self._word_context_index = ContextIndex(self.tokens,
                                                filter=lambda x:x.isalpha(),
                                                key=lambda s:s.lower())

#   words = self._word_context_index.similar_words(word, num)

    word = word.lower()
    wci = self._word_context_index._word_to_contexts
    if word in wci.conditions():
        contexts = set(wci[word])
        fd = FreqDist(w for w in wci.conditions() for c in wci[w]
                      if c in contexts and not w == word)
        words = fd.keys()[:num]
        print tokenwrap(words)
    else:
        print "No matches"

Note: in a FreqDist, unlike a dict, keys() returns a sorted list.

Example:

import nltk

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

similar_words = text._word_context_index.similar_words('woman')
print ' '.join(similar_words)

Output:

man day time year car moment world family house boy child country
job state girl place war way case question   # Text.similar()

#man ('a', 'who') 9 39   # output from similar_words(); see following explanation
#girl ('a', 'who') 9 6
#[...]

man number time world fact end year state house way day use part
kind boy matter problem result girl group   # ContextIndex.similar_words()

fd, the frequency distribution in similar(), is a tally of the number of contexts for each word:

fd = [('man', 52), ('day', 30), ('time', 30), ('year', 28), ('car', 24), ('moment', 24), ('world', 23) ...]

For each word in each context, similar_words() calculates the sum of the product of the frequencies:

man ('a', 'who') 9 39  # 'a man who' occurs 39 times in text;
                       # 'a woman who' occurs 9 times
                       # Similarity score for the context is the product:
                       #     score['man'] = 9 * 39
girl ('a', 'who') 9 6
writer ('a', 'who') 9 4
boy ('a', 'who') 9 3
child ('a', 'who') 9 2
dealer ('a', 'who') 9 2
...
man ('a', 'and') 6 11  # score += 6 * 11
...
man ('a', 'he') 4 6    # score += 4 * 6
...
[49 more occurrences of 'man']
like image 108
Richard Avatar answered Oct 10 '22 01:10

Richard