Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate the similarity of English words that do not appear in WordNet?

A particular natural language practice is to calculate the similarity between two words using WordNet. I start my question with the following python code:

from nltk.corpus import wordnet
sport = wordnet.synsets("sport")[0]
badminton = wordnet.synsets("badminton")[0]
print(sport.wup_similarity(badminton))

We will get 0.8421

Now what if I look for "haha" and "lol" as following:

haha = wordnet.synsets("haha")
lol = wordnet.synsets("lol")
print(haha)
print(lol)

We will get

[]
[]

Then we cannot consider the similarity between them. What can we do in this case?

like image 428
Duong Trung Nghia Avatar asked Jul 08 '16 19:07

Duong Trung Nghia


2 Answers

You can create a semantic space from cooccurrence matrices using a tool like Dissect (DIStributional SEmantics Composition Toolkit) and then you are set to measure semantic similarity between words or phrases (if you compose words).

In your case for ha and lol you'll need to collect those cooccurrences.

Another thing to try is word2vec.

like image 180
sarnthil Avatar answered Oct 13 '22 04:10

sarnthil


There are two possible other ways:

CBOW: continuous bag of word

skip gram model: This model is vice versa of CBOW model

look at this: https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures-in-laymans-terms

These model are well represted here: https://www.tensorflow.org/tutorials/word2vec, also GENSIM is a good python library for doing such these things


Try to look for Tensorflow Solutions, For example this: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py

Or try to look for word2vec: https://en.wikipedia.org/wiki/Word2vec

like image 44
Masoud Avatar answered Oct 13 '22 02:10

Masoud