I have a string of characters of length 50 say representing a sequence abbcda....
for alphabets taken from the set A={a,b,c,d}
.
I want to calculate how many times b
is followed by another b
(n-grams) where n=2.
Similarly, how many times a particular character is repeated thrice n=3 consecutively, say in the input string abbbcbbb
etc so here the number of times b
occurs in a sequence of 3 letters is 2.
The most straightforward way to measure co-occurrence between two species is by the observed number of times that the two spe- cies co-occur relative to the expected number of times (San- derson, 2000; Sfenthourakis et al., 2004, 2006; Veech, 2006, 2013; Pitta et al., 2012).
In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order.
The co-occurrence matrix indicates how many times the row word (e.g. 'digital' ) is surrounded (in a sentence, or in the ±4 word window - depends on the application) by the column word (e.g. 'pie' ).
Co-occurrence analysis is simply the counting of paired data within a collection unit. For example, buying shampoo and a brush at a drug store is an example of co-occurrence. Here the data is the brush and the shampoo, and the collection unit is the particular transaction.
To find the number of non-overlapping 2-grams you can use
numel(regexp(str, 'b{2}'))
and for 3-grams
numel(regexp(str, 'b{3}'))
to count overlapping 2-grams use positive lookahead
numel(regexp(str, '(b)(?=b{1})'))
and for overlapping n
-grams
numel(regexp(str, ['(b)(?=b{' num2str(n-1) '})']))
EDIT
In order to find number of occurrences of an arbitrary sequence use the first element in first parenthesis and the rest after equality sign, to find ba
use
numel(regexp(str, '(b)(?=a)'))
to find bda
use
numel(regexp(str, '(b)(?=da)'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With