I'm working on an NLP task and I need to calculate the co-occurrence matrix over documents. The basic formulation is as below:
Here I have a matrix with shape (n, length)
, where each row represents a sentence composed by length
words. So there are n
sentences with same length in all. Then with a defined context size, e.g., window_size = 5
, I want to calculate the co-occurrence matrix D
, where the entry in the cth
row and wth
column is #(w,c)
, which means the number of times that a context word c
appears in w
's context.
An example can be referred here. How to calculate the co-occurrence between two words in a window of text?
I know it can be calculate by stacking loops, but I want to know if there exits an simple way or simple function? I have find some answers but they cannot work with a window sliding through the sentence. For example:word-word co-occurrence matrix
So could anyone tell me is there any function in Python can deal with this problem concisely? Cause I think this task is quite common in NLP things.
The normalized co-occurrence matrix is obtained by dividing each element of G by the total number of co-occurrence pairs in G. The adjacency can be defined to take place in each of the four directions (horizontal, vertical, left and right diagonal) as shown in figure1.
A co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.
You need to change the initialization of the gray level co-occurrence matrix to glcm = np. zeros((256, 256), dtype=int) , otherwise if the image to process contains some pixels with the intensity level 255 , the function getGLCM will throw an error.
A co-occurrence matrix or co-occurrence distribution (also referred to as : gray-level co-occurrence matrices GLCMs) is a matrix that is defined over an image to be the distribution of co-occurring pixel values (grayscale values, or colors) at a given offset.
This article attempts to provide a brief introduction to the co-occurrence matrix and its implementation in python. Given a document with a set of sentences in it, the co-occurrence matrix is a matrix form of representation of this document.
To get the population covariance matrix (based on N), you’ll need to set the bias to True in the code below. This is the complete Python code to derive the population covariance matrix using the numpy package: Run the code, and you’ll get the following matrix:
A correlation matrix has been created using the following two libraries: Numpy library make use of corrcoef () function that returns a matrix of 2×2. The matrix consists of correlations of x with x (0,0), x with y (0,1), y with x (1,0) and y with y (1,1).
To implement co-occurence matrix in sucha a way that number of times word1 occured in context of word2 in neighbourhood of given value, lets say 5. There are 100 words and a list with 1000 sentences.
It is not that complicated, I think. Why not make a function for yourself? First get the co-occurrence matrix X according to this tutorial: http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage Then for each sentence, calculate the co-occurrence and add them to a summary variable.
m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
for i,word in enumerate(sentence):
for j in range(max(i-window,0),min(i+window,length)):
m[word,sentence[j]]+=1
for sentence in X:
cal_occ(sentence, m)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With