I am using this GSDMM python implementation to cluster a dataset of text messages. GSDMM converges fast (around 5 iterations) according the inital paper. I also have a convergence to a certain number of clusters, but there are still a lot of messages transferred in each iteration, so a lot of messages are still changing their cluster.
My output looks like:
In stage 0: transferred 9511 clusters with 150 clusters populated
In stage 1: transferred 4974 clusters with 138 clusters populated
In stage 2: transferred 2533 clusters with 90 clusters populated
….
In stage 34: transferred 1403 clusters with 47 clusters populated
In stage 35: transferred 1410 clusters with 47 clusters populated
In stage 36: transferred 1430 clusters with 48 clusters populated
In stage 37: transferred 1463 clusters with 48 clusters populated
In stage 38: transferred 1359 clusters with 48 clusters populated
In the initial paper figure 3 shows the same pattern, the number of clusters in nearly constant.
What I can't figure out is how many messages of their dataset where still transfering. My understanding is, that this number should be as small as possible, in best case zero (so every message "found" the right cluster). So the number of clusters might be converging, but that doens´t say much about the quality of the algorithm/clusters. Is my understanding correct?
It also is a possibility that my data is not good enough to get proper clustering.
GSDMM is essentially a modified LDA (Latent Dirichlet Allocation) which assumes that a document (tweet or text for instance) encompasses 1 topic. This differs from LDA which assumes that a document can have multiple topics.
Short text clustering is a challenging task due to the lack of signal contained in short texts. In this work, we propose iterative classification as a method to boost the clustering quality of short texts. The idea is to repeatedly reassign (classify) outliers to clusters until the cluster assignment stabilizes.
The most popular algorithms for clustering are K-means and its variants such as bisecting K-means and K-medoids [4]. The K-means algorithm is a simple, fast, and unsupervised partitioning algorithm offering easily parallelized and intuitive results [5].
After a deeper dive into the functionality of the GSDMM Algorithm, I can share some new information.
Here is some background information about the algorithm, of course this is not a complete description of how the algorithm works:
• GSDMM is a soft clustering algorithm
• Underlaying the allocations of inputs (e.g. messages) to clusters are distributions (Multinomial distributions with Dirichlet Distributions as their Prior)
• The “Score”-Metric, that shows probability of an input belonging to a cluster, is based on a multinomial distribution and over all clusters adds up to 1
So as long as you don´t have very clear and easy separable clusters there will be inputs that “belong” to several clusters with a significant probability, e.g. Message 1 has a score value of 0.5 for Cluster 1, a score value of 0.4 for Cluster 2 and 0.1 for all the other clusters combined. If there are inputs with score values like that, due to the assignment depending on the multinomial distribution they sometimes will jump for one cluster to another one.
With knowing that I would say it is normal, even after a lot of iterations, to have jumping inputs. To measure the quality of your clustering, you should assign the inputs to the cluster with their highest score value and should not take the clustering based on the last iteration of your training.
Another option would be to leave out inputs that jump a lot or don´t have a cluster with a superior value, because these inputs are not good to fit in clusters (maybe some bad data, of course depending from case to case)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With