This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps. <ol> <li>Generate k random means for the entire group</li> <li>Create K clusters by associating each word with the nearest mean</li> <li>Compute centroid of each cluster, which becomes the new mean</li> <li>Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.</li> </ol> Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are: <ol> <li>How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc</li> <li>How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.</li> <li>How do you calculate the centroid?</li> <li>When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?</li> </ol> Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(

As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering: <ol> <li>Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words) </li> <li>Build minimum spanning tree which contains all of your words </li> <li> Remove links, which have length greater than some threshold </li> <li> Linked groups of words are clusters of similar words </li> </ol> Here is small illustration: <img src="https://i.stack.imgur.com/VLgSW.png" alt="enter image description here"> P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction

Clustering words into groups

Tags:

cluster-analysis

k-means

text-analysis

This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.

Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.

Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:

How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?

Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(

685

asked Dec 07 '12 18:12

Parijat Kalia

1 Answers

As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:

Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words

Here is small illustration:

enter image description here

P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree

P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction

answered Oct 22 '22 20:10

stemm

Related questions
                            
                                large scale clustering library possibly with python bindings
                            
                                permuting the rows and columns of a matrix for clustering [closed]
                            
                                How to summarize a list of combination
                            
                                Clustering a large, very sparse, binary matrix in R
                            
                                Python KMeans clustering words
                            
                                Clustering and Bayes classifiers Matlab
                            
                                User profiling with Mahout from categorized user behavior
                            
                                Clustering algorithm in R for missing categorical and numerical values
                            
                                How to pick the T1 and T2 threshold values for Canopy Clustering?
                            
                                How to cluster an instance with Weka's DBSCAN?
                            
                                Plotting the boundaries of cluster zone in Python with scikit package
                            
                                How to add ColSideColors on heatmap.2 after performing bi-clustering (row and column)
                            
                                Why is Adjusted rand index(ARI) better than rand index(RI) and how to understand ARI intuitively from the formula
                            
                                Check if one regex covers another regex
                            
                                Algorithm for clustering with minimum size constraints
                            
                                How to perform cluster with weights/density in python? Something like kmeans with weights?
                            
                                Parameter estimation in DBSCAN
                            
                                Sklearn : Mean Distance from Centroid of each cluster
                            
                                given 10 functions y=a+bx and 1000's of (x,y) data points rounded to ints, how to derive 10 best (a,b) tuples?
                            
                                initial centroids for scikit-learn kmeans clustering

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With