Big picture goal: I am making an LDA model of product reviews in Python using NLTK and Gensim. I want to run this on varying n-grams.
Problem: Everything is great with unigrams, but when I run with bigrams, I start to get topics with repeated information. For example, Topic 1 might contain: ['good product', 'good value']
, and Topic 4 might contain: ['great product', 'great value']
. To a human these are obviously conveying the same information, but obviously 'good product'
and 'great product'
are distinct bigrams. How do I algorithmically determine that 'good product'
and 'great product'
are similar enough, so I can translate all occurrences of one of them to the other (maybe the one that appears more often in the corpus)?
What I've tried: I played around with WordNet's Synset tree, with little luck. It turns out that good
is an 'adjective', but great
is an 'adjective satellite', and therefore return None
for path similarity. My thought process was to do the following:
Ideally, though, I'd like an algorithm that can determine that good
and great
are similar in my corpus (perhaps in a co-occurring sense), so that it can be extended to words that aren't part of the general English language, but appear in my corpus, and so that it can be extended to n-grams (maybe Oracle
and terrible
are synonymous in my corpus, or feature engineering
and feature creation
are similar).
Any suggestions on algorithms, or suggestions to get WordNet synset to behave?
If you're going to use WordNet, you have
Problem 1: Word Sense Disambiguation (WSD), i.e. how to automatically determine which synset to use?
>>> for i in wn.synsets('good','a'):
... print i.name, i.definition
...
good.a.01 having desirable or positive qualities especially those suitable for a thing specified
full.s.06 having the normally expected amount
good.a.03 morally admirable
estimable.s.02 deserving of esteem and respect
beneficial.s.01 promoting or enhancing well-being
good.s.06 agreeable or pleasing
good.s.07 of moral excellence
adept.s.01 having or showing knowledge and skill and aptitude
good.s.09 thorough
dear.s.02 with or in a close or intimate relationship
dependable.s.04 financially sound
good.s.12 most suitable or right for a particular purpose
good.s.13 resulting favorably
effective.s.04 exerting force or influence
good.s.15 capable of pleasing
good.s.16 appealing to the mind
good.s.17 in excellent physical condition
good.s.18 tending to promote physical well-being; beneficial to health
good.s.19 not forged
good.s.20 not left to spoil
good.s.21 generally admired
>>> for i in wn.synsets('great','a'):
... print i.name, i.definition
...
great.s.01 relatively large in size or number or extent; larger than others of its kind
great.s.02 of major significance or importance
great.s.03 remarkable or out of the ordinary in degree or magnitude or effect
bang-up.s.01 very good
capital.s.03 uppercase
big.s.13 in an advanced stage of pregnancy
Let's say you somehow get the correct sense, maybe you tried something like this (https://github.com/alvations/pywsd) and let's say you get the POS and synset right:
good.a.01 having desirable or positive qualities especially those suitable for a thing specified great.s.01 relatively large in size or number or extent; larger than others of its kind
Problem 2: How are you going to compare the 2 synsets?
Let's try similarity functions, but you realized that they give you no score:
>>> good = wn.synsets('good','a')[0]
>>> great = wn.synsets('great','a')[0]
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))
None
>>> print max(wn.wup_similarity(good,great), wn.wup_similarity(great, good))
>>> print max(wn.res_similarity(good,great,semcor_ic), wn.res_similarity(great, good,semcor_ic))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1312, in res_similarity
return synset1.res_similarity(synset2, ic, verbose)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 738, in res_similarity
ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
(synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.jcn_similarity(good,great,semcor_ic), wn.jcn_similarity(great, good,semcor_ic))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1316, in jcn_similarity
return synset1.jcn_similarity(synset2, ic, verbose)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 759, in jcn_similarity
ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
(synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
return synset1.lin_similarity(synset2, ic, verbose)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
(synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lch_similarity(good,great), wn.lch_similarity(great, good))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1304, in lch_similarity
return synset1.lch_similarity(synset2, verbose, simulate_root)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 638, in lch_similarity
(self, other))
nltk.corpus.reader.wordnet.WordNetError: Computing the lch similarity requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
Let's try a different pair of synsets, since good
has both satellite-adjective
and adjective
while great
only have satellite
, let's go with the lowest common denominator:
good.s.13 resulting favorably
great.s.01 relatively large in size or number or extent; larger than others of its kind
You realize that there is still no similarity information for comparing between satellite-adjective
:
>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
return synset1.lin_similarity(synset2, ic, verbose)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1645, in _lcs_ic
ic1 = information_content(synset1, ic)
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1666, in information_content
raise WordNetError(msg % synset.pos)
nltk.corpus.reader.wordnet.WordNetError: Information content file has no entries for part-of-speech: s
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))None
None
Now seems like WordNet is creating more problems than it's solving anything here, let's try another means, let's try word clustering, see http://en.wikipedia.org/wiki/Word-sense_induction
This is when i also give up on answering the broad and opened question that the OP has posted because there's a LOT done in clustering that are automagics to mere mortals like me =)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With