Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

clustering list of words in python

I am a newbie in text mining, here is my situation. Suppose i have a list of words ['car', 'dog', 'puppy', 'vehicle'], i would like to cluster words into k groups, I want the output to be [['car', 'vehicle'], ['dog', 'puppy']]. I first calculate similarity score of each pairwise word to obtain a 4x4 matrix(in this case) M, where Mij is the similarity score of word i and j. After transforming the words into numeric data, i utilize different clustering library(such as sklearn) or implement it by myself to get the word clusters.

I want to know does this approach makes sense? Besides, how do I determine the value of k? More importantly, i know that there exist different clustering technique, i am thinking whether i should use k-means or k-medoids for word clustering?

like image 416
Kevin Lee Avatar asked Jan 31 '17 11:01

Kevin Lee


People also ask

How do I cluster a list of words?

The easiest way to build a group is by collecting synonyms for a particular word. The group of synonyms becomes a cluster which has one common meaning and for that one meaning, you have effectively learnt multiple words.

How do you cluster text data in Python?

First, the number of clusters must be specified and then this same number of 'centroids' are randomly allocated. The Euclidean distance is then measured between each data point and the centroids. The data point is then 'assigned' to the centroid which is closest.

Which clustering algorithm is best for text data?

Based on experimental results, we discuss the features of a document clustering problem with the nature of SI algorithms and conclude that PSO and GWO are better than the traditional K-means clustering algorithm and PSO is the best performing algorithm in terms of finding the optimal solution.


2 Answers

Following up the answer by Brian O'Donnell, once you've computed the semantic similarity with word2vec (or FastText or GLoVE, ...), you can then cluster the matrix using sklearn.clustering. I've found that for small matrices, spectral clustering gives the best results.

It's worth keeping in mind that the word vectors are often embedded on a high-dimensional sphere. K-means with a Euclidean distance matrix fails to capture this, and may lead to poor results for the similarity of words that aren't immediate neighbors.

like image 61
Hooked Avatar answered Sep 30 '22 08:09

Hooked


If you want to cluster words by their "semantic similarity" (i.e. likeness of their meaning) take a look at Word2Vec and GloVe. Gensim has an implementation for Word2Vec. This web page, "Word2Vec Tutorial", by Radim Rehurek gives a tutorial on using Word2Vec to determine similar words.

like image 32
Brian O'Donnell Avatar answered Sep 30 '22 07:09

Brian O'Donnell