tf-idf: am I understanding it right?

Question

I am interested in doing some document clustering, and right now I am considering using TF-IDF for this.

If I am not wrong, TF-IDF is particularly used for evaluating the relevance of a document given a query. If I do not have a particular query, how can I apply tf-idf to clustering?

Kapil D · Accepted Answer

For document clustering. the best approach is to use k-means algorithm. If you know how many types of documents you have you know what k is.

To make it work on documents:

a) say choose initial k documents at random.

b) Assign each document to a cluser using the minimum distance for a document with the cluster.

c) After documents are assigned to the cluster make K new documents as cluster by taking the centroid of each cluster.

Now, the question is

a) How to calculate distance between 2 documents: Its nothing but cosine similarity of terms of documents with initial cluster. Terms here are nothing but TF-IDF(calculated earlier for each document)

b) Centroid should be: sum of TF-IDF of a given term/ no. of documents. Do, this for all the possible terms in a cluster. this will give you another n-dimensional documents.

Hope thats helps!

tf-idf: am I understanding it right?

Tags:

language-agnostic

algorithm

text-processing

information-retrieval

tf-idf

alskndalsnd

1 Answers

Kapil D

Recent Activity

Donate For Us

tf-idf: am I understanding it right?

Tags:

language-agnostic

algorithm

text-processing

information-retrieval

tf-idf

alskndalsnd

1 Answers

Kapil D

Related questions

Recent Activity

Donate For Us