Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate word co-occurence

Tags:

string

matlab

I have a string of characters of length 50 say representing a sequence abbcda.... for alphabets taken from the set A={a,b,c,d}.

I want to calculate how many times b is followed by another b (n-grams) where n=2.

Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb etc so here the number of times b occurs in a sequence of 3 letters is 2.

like image 963
Srishti M Avatar asked Jul 28 '13 21:07

Srishti M


People also ask

How is co-occurrence calculated?

The most straightforward way to measure co-occurrence between two species is by the observed number of times that the two spe- cies co-occur relative to the expected number of times (San- derson, 2000; Sfenthourakis et al., 2004, 2006; Veech, 2006, 2013; Pitta et al., 2012).

What is word co-occurrence?

In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order.

What is word co-occurrence matrix?

The co-occurrence matrix indicates how many times the row word (e.g. 'digital' ) is surrounded (in a sentence, or in the ±4 word window - depends on the application) by the column word (e.g. 'pie' ).

What is a co-occurrence analysis?

Co-occurrence analysis is simply the counting of paired data within a collection unit. For example, buying shampoo and a brush at a drug store is an example of co-occurrence. Here the data is the brush and the shampoo, and the collection unit is the particular transaction.


1 Answers

To find the number of non-overlapping 2-grams you can use

numel(regexp(str, 'b{2}'))

and for 3-grams

numel(regexp(str, 'b{3}'))

to count overlapping 2-grams use positive lookahead

numel(regexp(str, '(b)(?=b{1})'))

and for overlapping n-grams

numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))

EDIT In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba use

numel(regexp(str, '(b)(?=a)'))

to find bda use

numel(regexp(str, '(b)(?=da)'))
like image 102
Mohsen Nosratinia Avatar answered Oct 03 '22 19:10

Mohsen Nosratinia