GSDMM Convergence of Clusters (Short Text Clustering)

Tags:

I am using this GSDMM python implementation to cluster a dataset of text messages. GSDMM converges fast (around 5 iterations) according the inital paper. I also have a convergence to a certain number of clusters, but there are still a lot of messages transferred in each iteration, so a lot of messages are still changing their cluster.

My output looks like:

In stage 0: transferred 9511 clusters with 150 clusters populated 
In stage 1: transferred 4974 clusters with 138 clusters populated 
In stage 2: transferred 2533 clusters with 90 clusters populated
….
In stage 34: transferred 1403 clusters with 47 clusters populated 
In stage 35: transferred 1410 clusters with 47 clusters populated 
In stage 36: transferred 1430 clusters with 48 clusters populated 
In stage 37: transferred 1463 clusters with 48 clusters populated 
In stage 38: transferred 1359 clusters with 48 clusters populated

In the initial paper figure 3 shows the same pattern, the number of clusters in nearly constant.

graph from the paper

What I can't figure out is how many messages of their dataset where still transfering. My understanding is, that this number should be as small as possible, in best case zero (so every message "found" the right cluster). So the number of clusters might be converging, but that doens´t say much about the quality of the algorithm/clusters. Is my understanding correct?

It also is a possibility that my data is not good enough to get proper clustering.

955

asked Jun 04 '20 09:06

simon

1 Answers

After a deeper dive into the functionality of the GSDMM Algorithm, I can share some new information.

Here is some background information about the algorithm, of course this is not a complete description of how the algorithm works:

• GSDMM is a soft clustering algorithm

• Underlaying the allocations of inputs (e.g. messages) to clusters are distributions (Multinomial distributions with Dirichlet Distributions as their Prior)

• The “Score”-Metric, that shows probability of an input belonging to a cluster, is based on a multinomial distribution and over all clusters adds up to 1

So as long as you don´t have very clear and easy separable clusters there will be inputs that “belong” to several clusters with a significant probability, e.g. Message 1 has a score value of 0.5 for Cluster 1, a score value of 0.4 for Cluster 2 and 0.1 for all the other clusters combined. If there are inputs with score values like that, due to the assignment depending on the multinomial distribution they sometimes will jump for one cluster to another one.

With knowing that I would say it is normal, even after a lot of iterations, to have jumping inputs. To measure the quality of your clustering, you should assign the inputs to the cluster with their highest score value and should not take the clustering based on the last iteration of your training.

Another option would be to leave out inputs that jump a lot or don´t have a cluster with a superior value, because these inputs are not good to fit in clusters (maybe some bad data, of course depending from case to case)

177

answered Sep 19 '22 11:09

simon

Related questions
                            
                                Object of type Response is not JSON serializable
                            
                                Correct POS tags for numbers substituted with ## in spacy
                            
                                Is zip_safe only relevant for the egg format?
                            
                                How to use the s3 hook in airflow
                            
                                How to use timedelta with pandas df.query()?
                            
                                Liquibase integration in python project
                            
                                Python Plotly: How to add an image to a 3D scatter plot
                            
                                Docker hyperkit process CPU usage going crazy. How to keep it under control?
                            
                                Dash: how to control graph style via CSS?
                            
                                Python: How to offer a single executable file without showing the code in 2020
                            
                                Cannot open jupyter notebook in VSCode
                            
                                How to find which library prevents updating a package in conda?
                            
                                Problem with KerasRegressor & multiple output
                            
                                Tensorflow graph nodes are exchange
                            
                                Error "Running as root without --no-sandbox is not supported"
                            
                                Fastparquet giving "TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO" while using dataframe.to_parquet()
                            
                                Change alembic logger
                            
                                Unable to Instantiate Python Dataclass (Frozen) inside a Pytest function that uses Fixtures
                            
                                How to deal with name clash collections.Counter and typing.Counter?
                            
                                Ubuntu 20.04 "Temporary failure in name resolution" - recently reinstalled

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

GSDMM Convergence of Clusters (Short Text Clustering)

Tags:

python

cluster-analysis

topic-modeling

convergence

simon

People also ask

1 Answers

simon

Recent Activity

Donate For Us