Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the relation between topic modeling and document clustering?

Topic modeling identifies distribution of topics in a document collection, which effectively identifies the clusters in the collection. So is it right to say that topic modeling is a technique to do document clustering?

like image 910
afs Avatar asked Mar 19 '13 02:03

afs


People also ask

What is the difference between clustering and topic modeling?

No matter what approach you select, in topic modeling you will end up with a list of topics, each containing a set of associated keywords. Things are slightly different in clustering! Here, the algorithm clusters documents into different groups based on a similarity measure.

Is Topic Modelling a clustering technique?

But topic models are not solely clustering methods, as can also been used for understanding, exploring, visualizing a collection. On the other hand, clustering methods aim at partitioning data into coherent groups.

What is meant by document clustering?

Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

What is a document in topic modeling?

Topic modeling is an unsupervised machine learning technique that's capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.


1 Answers

A topic is quite different from a cluster of docs, after all, a topic is not composed of docs.

However, these two techniques are indeed related. I believe Topic Modeling is a viable way of deciding how similar documents are, hence a viable way for document clustering.

In representing each document as a topic distribution (actually a vector), topic modeling techniques reduce the feature dimensionality from number of distinct words appeared (in a corpus) to the number of topics. Similarity between docs' Topic distributions can be calculated using Cosine metrics and many other metrics, which reflect the similarity of the docs themselves in terms of the topics/themes they cover. Based on this quantified similarity measure, many clustering algorithms can be applied to group the documents.

And in this sense, I think it is right to say that topic modeling is a technique to do document clustering.

like image 197
Shockley Avatar answered Sep 26 '22 02:09

Shockley