Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Estimating confidence intervals of a Markov transition matrix

I have a series of n=400 sequences of varying length containing the letters ACGTE. For example, the probability of having C after A is:

enter image description here

and which can be calculated from the set of empirical sequences, thus

enter image description here

Assuming: enter image description here

Then I get a transition matrix:

enter image description here

But I'm interested in calculating the confidence intervals for Phat, any thoughts on how I could I go about it?

like image 298
HCAI Avatar asked Jul 15 '13 21:07

HCAI


2 Answers

You could use bootstrapping to estimate confidence intervals. MATLAB provides bootci function in the Statistics toolbox. Here is an example:

%# generate a random cell array of 400 sequences of varying length
%# each containing indices from 1 to 5 corresponding to ACGTE
sequences = arrayfun(@(~) randi([1 5], [1 randi([500 1000])]), 1:400, ...
    'UniformOutput',false)';

%# compute transition matrix from all sequences
trans = countFcn(sequences);

%# number of bootstrap samples to draw
Nboot = 1000;

%# estimate 95% confidence interval using bootstrapping
ci = bootci(Nboot, {@countFcn, sequences}, 'alpha',0.05);
ci = permute(ci, [2 3 1]);

We get:

>> trans         %# 5x5 transition matrix: P_hat
trans =
      0.19747       0.2019      0.19849       0.2049      0.19724
      0.20068      0.19959      0.19811      0.20233      0.19928
      0.19841      0.19798       0.2021       0.2012      0.20031
      0.20077      0.19926      0.20084      0.19988      0.19926
      0.19895      0.19915      0.19963      0.20139      0.20088

and two other similar matrices containing the lower and upper bounds of confidence intervals:

>> ci(:,:,1)     %# CI lower bound
>> ci(:,:,2)     %# CI upper bound

I am using the following function to compute the transition matrix from a set of sequences:

function trans = countFcn(seqs)
    %# accumulate transition matrix from all sequences
    trans = zeros(5,5);
    for i=1:numel(seqs)
        trans = trans + sparse(seqs{i}(1:end-1), seqs{i}(2:end), 1, 5,5);
    end

    %# normalize into proper probabilities
    trans = bsxfun(@rdivide, trans, sum(trans,2));
end

As a bonus, we can use bootstrp function to get the statistic computed from each bootstrap sample, which we use to show a histogram for each of the entries in the transition matrix:

%# compute multiple transition matrices using bootstrapping
stat = bootstrp(Nboot, @countFcn, sequences);

%# display histogram for each entry in the transition matrix
sub = reshape(1:5*5,5,5);
figure
for i=1:size(stat,2)
    subplot(5,5,sub(i))
    hist(stat(:,i))
end

bootstrap_histograms

like image 50
Amro Avatar answered Sep 21 '22 18:09

Amro


Not sure whether it is statistically sound, but an easy way to get an indicative upper and lower bound:

Cut your sample in n equal pieces (for example 1:40,41:80,...,361:400) and calculate the probability matrix for each of these subsamples.

By looking at the distribution of probabilities amongst subsamples you should get a pretty good idea of what the variance is.

The disadvantage of this method is that it may not be possible actually calculate an interval with a desired given probability. The advantage is that it should give you good feeling for how the series behaves, and that it may capture some information that could be lost in other methods due to the assumptions that other methods (for example bootstrapping) are based on.

like image 39
Dennis Jaheruddin Avatar answered Sep 22 '22 18:09

Dennis Jaheruddin