How do I calculate a word-word co-occurrence matrix with sklearn?

Question

I am looking for a module in sklearn that lets you derive the word-word co-occurrence matrix.

I can get the document-term matrix but not sure how to go about obtaining a word-word matrix of co-ocurrences.

titipata · Accepted Answer

Here is my example solution using CountVectorizer in scikit-learn. And referring to this post, you can simply use matrix multiplication to get word-word co-occurrence matrix.

from sklearn.feature_extraction.text import CountVectorizer docs = ['this this this book',         'this cat good',         'cat good shit'] count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model X = count_model.fit_transform(docs) # X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below) Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0 print(Xc.todense()) # print out matrix in dense format

You can also refer to dictionary of words in count_model,

count_model.vocabulary_

Or, if you want to normalize by diagonal component (referred to answer in previous post).

import scipy.sparse as sp Xc = (X.T * X) g = sp.diags(1./Xc.diagonal()) Xc_norm = g * Xc # normalized co-occurence matrix

Extra to note @Federico Caccia answer, if you don't want co-occurrence that are spurious from the own text, set occurrence that is greater that 1 to 1 e.g.

X[X > 0] = 1 # do this line first before computing cooccurrence Xc = (X.T * X) ...

How do I calculate a word-word co-occurrence matrix with sklearn?

Tags:

newdev14

1 Answers

titipata

Recent Activity

Donate For Us

How do I calculate a word-word co-occurrence matrix with sklearn?

Tags:

newdev14

1 Answers

titipata

Related questions

Recent Activity

Donate For Us