clustering list of words in python

Tags:

I am a newbie in text mining, here is my situation. Suppose i have a list of words ['car', 'dog', 'puppy', 'vehicle'], i would like to cluster words into k groups, I want the output to be [['car', 'vehicle'], ['dog', 'puppy']]. I first calculate similarity score of each pairwise word to obtain a 4x4 matrix(in this case) M, where Mij is the similarity score of word i and j. After transforming the words into numeric data, i utilize different clustering library(such as sklearn) or implement it by myself to get the word clusters.

I want to know does this approach makes sense? Besides, how do I determine the value of k? More importantly, i know that there exist different clustering technique, i am thinking whether i should use k-means or k-medoids for word clustering?

416

asked Jan 31 '17 11:01

Kevin Lee

2 Answers

Following up the answer by Brian O'Donnell, once you've computed the semantic similarity with word2vec (or FastText or GLoVE, ...), you can then cluster the matrix using sklearn.clustering. I've found that for small matrices, spectral clustering gives the best results.

It's worth keeping in mind that the word vectors are often embedded on a high-dimensional sphere. K-means with a Euclidean distance matrix fails to capture this, and may lead to poor results for the similarity of words that aren't immediate neighbors.

answered Sep 30 '22 08:09

Hooked

If you want to cluster words by their "semantic similarity" (i.e. likeness of their meaning) take a look at Word2Vec and GloVe. Gensim has an implementation for Word2Vec. This web page, "Word2Vec Tutorial", by Radim Rehurek gives a tutorial on using Word2Vec to determine similar words.

answered Sep 30 '22 07:09

Brian O'Donnell

Related questions
                            
                                Pandas: merge dataframes without creating new columns
                            
                                How to start/stop a Python function within a time period (ex. from 10 am to 12:30pm)?
                            
                                Annualized Return in Pandas
                            
                                How to get a element to stick to the bottom-right corner in Tkinter?
                            
                                Searching one Python dataframe / dictionary for fuzzy matches in another dataframe
                            
                                subprocess.Popen shell=True to shell=False
                            
                                Python, Pandas, Numpy: Date_range: passing a np.timedelta as freq. argument
                            
                                Why doesn't this if statement execute? [closed]
                            
                                How to inject values into the middle of TensorFlow graph?
                            
                                Pandas: union duplicate strings
                            
                                Fine tuning pretrained model in keras
                            
                                Python argparse arguments with repeatable parameter pairs
                            
                                What exactly does the -q option of netcat do?
                            
                                .astype("int") or .astype(int)? Any differences between with and without quote/double?
                            
                                Elastic beanstalk require python 3.5
                            
                                Optimizing assignment into an array from various arrays - NumPy
                            
                                How to use russian date string with strptime
                            
                                numpy array reshape adding dimension
                            
                                How to Use Getter Without Setter
                            
                                plotting single 3D point on top of plot_surface in python matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

clustering list of words in python

Tags:

python

nlp

cluster-analysis

text-mining

Kevin Lee

People also ask

2 Answers

Hooked

Brian O'Donnell

Recent Activity

Donate For Us