The pypi docs for a google ngram downloader say that "sometimes you need an aggregate data over the dataset. For example to build a co-occurrence matrix." The wikipedia for co-occurence matrix has to do with image processing and googling the term seems to bring up some sort of SEO trick. So what are co-occurrence matrixes (in computational linguistics/NLP)? How are they used in NLP?

<h3>What is a co-occurrence matrix ?</h3> Generally speaking, a co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur. In NLP, the most classic approach is to define each entity (ie, lines and columns) as a word present in a text, and the context as a sentence. Consider the following text : <blockquote> Roses are red. Sky is blue. </blockquote> With the classic approach described before, we'll have the following matrix : <pre class="prettyprint"><code> | Roses | are | red | Sky | is | blue Roses | 1 | 1 | 1 | 0 | 0 | 0 are | 1 | 1 | 1 | 0 | 0 | 0 red | 1 | 1 | 1 | 0 | 0 | 0 Sky | 0 | 0 | 0 | 1 | 1 | 1 is | 0 | 0 | 0 | 1 | 1 | 1 Blue | 0 | 0 | 0 | 1 | 1 | 1 </code></pre> Here, each cell indicates wether the two items co-occur or not. You may replace it with the number of times it appears, or with a more sophisticated approach. You may also change the entities themselves, by putting nouns in columns and adjective in lines instead of every word. <h3>What are they used for in NLP ?</h3> The most evident use of these matrix is their ability to provide links between notions. Let's suppose you're working on products reviews. Let's also suppose for simplicity that each review is only composed of short sentences. You'll have something like that : <blockquote> ProductX is amazing. I hate productY. </blockquote> Representing these reviews as one co-occurrence matrix will enable you associate products with appreciations.

The co-occurrence matrix indicates how many times the row word (e.g. <code>'digital'</code>) is surrounded (in a sentence, or in the ±4 word window - depends on the application) by the column word (e.g. <code>'pie'</code>). The entry <code>'5'</code> in the following table, for example, means that we had 5 sentences in our text where <code>'digital'</code> was surrounded by <code>'pie'</code>. <img src="https://i.stack.imgur.com/piZ39.png" alt="enter image description here"> These sentences could have been: <ul> <li>I love a digital pie.</li> <li>What's digital is often a pie.</li> <li>May I have some digital pie?</li> <li> Digital world necessitates pie-eating.</li> <li>There's something digital about this pie.</li> </ul> <hr> Note that the co-occurrence matrix is always symmetric - the entry with the row word <code>'pie'</code> and the column word <code>'digital'</code> will be <code>5</code> as well (as these words co-occur in the very same sentences!).

What are co-occurence matrixes and how are they used in NLP?

2 Answers

What is a co-occurrence matrix ?

Generally speaking, a co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.

In NLP, the most classic approach is to define each entity (ie, lines and columns) as a word present in a text, and the context as a sentence.

Consider the following text :

Roses are red. Sky is blue.

With the classic approach described before, we'll have the following matrix :

      |  Roses | are | red | Sky | is | blue
Roses |    1   |  1  |  1  |  0  |  0 |   0
are   |    1   |  1  |  1  |  0  |  0 |   0
red   |    1   |  1  |  1  |  0  |  0 |   0
Sky   |    0   |  0  |  0  |  1  |  1 |   1
is    |    0   |  0  |  0  |  1  |  1 |   1
Blue  |    0   |  0  |  0  |  1  |  1 |   1

Here, each cell indicates wether the two items co-occur or not. You may replace it with the number of times it appears, or with a more sophisticated approach. You may also change the entities themselves, by putting nouns in columns and adjective in lines instead of every word.

What are they used for in NLP ?

The most evident use of these matrix is their ability to provide links between notions. Let's suppose you're working on products reviews. Let's also suppose for simplicity that each review is only composed of short sentences. You'll have something like that :

ProductX is amazing.

I hate productY.

Representing these reviews as one co-occurrence matrix will enable you associate products with appreciations.

answered Nov 07 '22 22:11

merours

The co-occurrence matrix indicates how many times the row word (e.g. 'digital') is surrounded (in a sentence, or in the ±4 word window - depends on the application) by the column word (e.g. 'pie').

The entry '5' in the following table, for example, means that we had 5 sentences in our text where 'digital' was surrounded by 'pie'.

enter image description here

These sentences could have been:

I love a digital pie.
What's digital is often a pie.
May I have some digital pie?
Digital world necessitates pie-eating.
There's something digital about this pie.

Note that the co-occurrence matrix is always symmetric - the entry with the row word 'pie' and the column word 'digital' will be 5 as well (as these words co-occur in the very same sentences!).

answered Nov 07 '22 20:11

lakesare

Related questions
                            
                                How does language detection work?
                            
                                What would the best tool to create a natural DSL in Java? [closed]
                            
                                How to know if two words have the same base?
                            
                                Redefining "sentence" in Emacs? (single space between sentences, but ignoring abbreviations)
                            
                                Comparing and matching product names from different stores/suppliers
                            
                                TypeError: can't pickle _thread.lock objects in Seq2Seq
                            
                                Effective 1-5 grams extraction with python
                            
                                Sentence compression using NLP [closed]
                            
                                Does NLTK have a tool for dependency parsing?
                            
                                What NLP tools to use to match phrases having similar meaning or semantics
                            
                                How to load sentences into Python gensim?
                            
                                Fast/Optimize N-gram implementations in python
                            
                                How does word2vec or skip-gram model convert words to vector?
                            
                                php sentence boundaries detection [duplicate]
                            
                                Stanford Core NLP - understanding coreference resolution
                            
                                Is wordnet path similarity commutative?
                            
                                NLP/Machine Learning text comparison [closed]
                            
                                What does a weighted word embedding mean?
                            
                                nltk language model (ngram) calculate the prob of a word from context
                            
                                Saving nltk drawn parse tree to image file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are co-occurence matrixes and how are they used in NLP?

Tags:

nlp

bernie2436

People also ask