Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform kmean clustering from Gensim TFIDF values

I am using Gensim for vector space model. after creating a dictionary and corpus from Gensim I calculated the (Term frequency*Inverse document Frequency)TFIDF using the following line

Term_IDF  = TfidfModel(corpus)
corpus_tfidf = Term_IDF[corpus]

The corpus_tfidf contain list of the list having Terms ids and corresponding TFIDF. then I separated the TFIDF from ids using following lines:

 for doc in corpus_tfidf:
     for ids,tfidf in doc:    
         IDS.append(ids)
         tfidfmtx.append(tfidf)    
         IDS=[]

now I want to use k-means clustering so I want to perform cosine similarities of tfidf matrix the problem is Gensim does not produce square matrix so when I run following line it generates an error. I wonder how can I get the square matrix from Gensim to calculate the similarities of all the documents in vector space model. Also how to convert tfidf matrix (which in this case is a list of lists) into 2D NumPy array. any comments are much appreciated.

dumydist = 1 - cosine_similarity(tfidfmtx)

like image 756
Nhqazi Avatar asked Jun 19 '18 17:06

Nhqazi


2 Answers

When you fit your corpus to a Gensim Dictionary, get the number or documents and tokens in the dictionary:

from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(corpus_lists)
num_docs = dictionary.num_docs
num_terms = len(dictionary.keys())

Transform into bow:

corpus_bow = [dictionary.doc2bow(doc) for doc in corpus_lists]

Transform into tf-idf:

from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]

Now you can transform into sparse/dense matrix:

from gensim.matutils import corpus2dense, corpus2csc
corpus_tfidf_dense = corpus2dense(corpus_tfidf, num_terms, num_docs)
corpus_tfidf_sparse = corpus2csc(corpus_tfidf, num_terms, num_docs)

Now fit your model using either sparse/dense matrix (after transposing):

model = KMeans(n_clusters=7)
clusters = model.fit_predict(corpus_bow_dense.T)
like image 78
DrGabrielA81 Avatar answered Nov 15 '22 19:11

DrGabrielA81


To create document term matrix from gensim, you may use matutils.corpus2csv

Corpus - list of list(Genism Corpus)

from scipy.sparse import csc_matrix

scipy_csc_matrix =genism.matutils.corpus2csc(corpus)

full_matrix=csc_matrix(scipy_csc_matrix).toarray()

you may want to use scipy sparse format if your corpus size is very large.

like image 3
Atendra Avatar answered Nov 15 '22 17:11

Atendra