Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - calculate the co-occurrence matrix

I'm working on an NLP task and I need to calculate the co-occurrence matrix over documents. The basic formulation is as below:

Here I have a matrix with shape (n, length), where each row represents a sentence composed by length words. So there are n sentences with same length in all. Then with a defined context size, e.g., window_size = 5, I want to calculate the co-occurrence matrix D, where the entry in the cth row and wth column is #(w,c), which means the number of times that a context word c appears in w's context.

An example can be referred here. How to calculate the co-occurrence between two words in a window of text?

I know it can be calculate by stacking loops, but I want to know if there exits an simple way or simple function? I have find some answers but they cannot work with a window sliding through the sentence. For example:word-word co-occurrence matrix

So could anyone tell me is there any function in Python can deal with this problem concisely? Cause I think this task is quite common in NLP things.

like image 969
GEORGE GUO Avatar asked Jan 15 '17 13:01

GEORGE GUO


People also ask

How do you find the co-occurrence matrix?

The normalized co-occurrence matrix is obtained by dividing each element of G by the total number of co-occurrence pairs in G. The adjacency can be defined to take place in each of the four directions (horizontal, vertical, left and right diagonal) as shown in figure1.

What is co-occurrence matrix Python?

A co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.

How does Python calculate Glcm matrix?

You need to change the initialization of the gray level co-occurrence matrix to glcm = np. zeros((256, 256), dtype=int) , otherwise if the image to process contains some pixels with the intensity level 255 , the function getGLCM will throw an error.

What is word co-occurrence matrix?

A co-occurrence matrix or co-occurrence distribution (also referred to as : gray-level co-occurrence matrices GLCMs) is a matrix that is defined over an image to be the distribution of co-occurring pixel values (grayscale values, or colors) at a given offset.

What is the co-occurrence matrix in Python?

This article attempts to provide a brief introduction to the co-occurrence matrix and its implementation in python. Given a document with a set of sentences in it, the co-occurrence matrix is a matrix form of representation of this document.

How to get population covariance matrix from NumPy in Python?

To get the population covariance matrix (based on N), you’ll need to set the bias to True in the code below. This is the complete Python code to derive the population covariance matrix using the numpy package: Run the code, and you’ll get the following matrix:

How to create a correlation matrix in Python?

A correlation matrix has been created using the following two libraries: Numpy library make use of corrcoef () function that returns a matrix of 2×2. The matrix consists of correlations of x with x (0,0), x with y (0,1), y with x (1,0) and y with y (1,1).

How to implement co-occurence matrix?

To implement co-occurence matrix in sucha a way that number of times word1 occured in context of word2 in neighbourhood of given value, lets say 5. There are 100 words and a list with 1000 sentences.


1 Answers

It is not that complicated, I think. Why not make a function for yourself? First get the co-occurrence matrix X according to this tutorial: http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage Then for each sentence, calculate the co-occurrence and add them to a summary variable.

m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
    for i,word in enumerate(sentence):
        for j in range(max(i-window,0),min(i+window,length)):
             m[word,sentence[j]]+=1
for sentence in X:
    cal_occ(sentence, m)
like image 100
Zealseeker Avatar answered Sep 23 '22 16:09

Zealseeker