Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK Most common synonym (Wordnet) for each word

Is there a way to find the most common synonym of a word with NLTK? I would like to simplify a sentence using the most common synonyms of each word on it.

If a word used in the sentence is already the most common word from its group of synonyms, it shouldn't be changed.

Let's say "Hi" is more common than "Hello"; "Dear" is more common than "Valued"; and "Friend" is already the most common word of its group os synonyms.

Input: "Hello my valued friend"
Return: "Hi my dear friend"
like image 390
42piratas Avatar asked Jul 06 '16 20:07

42piratas


Video Answer


2 Answers

Synonyms are tricky, but if you are starting out with a synset from Wordnet and you simply want to choose the most common member in the set, it's pretty straightforward: Just build your own frequency list from a corpus, and look up each member of the synset to pick the maximum.

The nltk will let you build a frequency table in just a few lines of code. Here's one based on the Brown corpus:

from nltk.corpus import brown
freqs = nltk.FreqDist(w.lower() for w in brown.words())

You can then look up the frequency of a word like this:

>>> print(freqs["valued"]) 
14

Of course you'll need to do a little more work: I would count words separately for each of the major parts of speech (wordnet provides n, v, a, and r, resp. noun, verb, adjective and adverb), and use these POS-specific frequencies (after adjusting for the different tagset notations) to choose the right substitute.

>>> freq2 = nltk.ConditionalFreqDist((tag, wrd.lower()) for wrd, tag in 
        brown.tagged_words(tagset="universal"))

>>> print(freq2["ADJ"]["valued"])
0
>>> print(freq2["ADJ"]["dear"])
45
like image 111
alexis Avatar answered Sep 22 '22 12:09

alexis


Synonyms are a huge and open area of work in natural language processing.

In your example, how is the program supposed to know what the allowed synonyms are? One method might be to keep a dictionary of sets of synonyms for each word. However, this can run into problems due to overlaps in parts of speech: "dear" is an adjective, but "valued" can be an adjective or a past-tense verb.

Context is also important: the bigram "dear friend" might be more common than "valued friend", but "valued customer" would be more common than "dear customer". So, the sense of a given word needs to be accounted for too.

Another method might be to just look at everything and see what words appear in similar contexts. You need a huge corpus for this to be effective though, and you have to decide how large a window of n-grams you want to use (a bigram context? A 20-gram context?).

I recommend you take a look at applications of WordNet (https://wordnet.princeton.edu/), which was designed to help figure some of these things out. Unfortunately, I'm not sure you'll find a way to "solve" synonyms on your own, but keep looking and asking questions!

Edit: I should have included this link to an older question as well:

How to get synonyms from nltk WordNet Python

And the NLTK documentation on its interface with WordNet:

http://www.nltk.org/howto/wordnet.html

I don't think these address your question, however, since WordNet doesn't have usage statistics (which are dependent on the corpus you use). You should be able to apply its synsets in a method like above, though.

like image 29
Clay Avatar answered Sep 21 '22 12:09

Clay