Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does bootstrapping improve the quality of a phylogenetic reconstruction?

My understanding of bootstrapping is that you

  1. Build a "tree" using some algorithm from a matrix of sequences (nucleotides, lets say).
  2. You store that tree.
  3. Perturb the matrix from 1, and rebuild the tree.

My question is: what is the purpose of 3 from a sequence bioinformatics perspective? I can try to "guess" that, by changing characters in the original matrix, you can remove artifacts in the data? But I have a problem with that guess: I am not sure, why removal of such artifacts is necessary. A sequence alignment is supposed to deal with artifacts by finding long lenghts of similarity, by its very nature.

like image 842
jayunit100 Avatar asked Oct 12 '11 02:10

jayunit100


People also ask

What is the purpose of bootstrapping in phylogenetic analysis?

The data generated by bootstrapping is used to estimate the confidence of the branches in a phylogenetic tree.

What is bootstrapping What does it tell about the reliability of a phylogenetic tree?

Bootstrapping is any test or metric that uses random sampling with replacement and falls under the broader class of resampling methods. It uses sampling with replacement to estimate the sampling distribution for the desired estimator. This approach is used to assess the reliability of sequence-based phylogeny.

How do bootstrap values increase in phylogenetic tree?

In my own experience, rethinking your out-group species, improved alignment (including coding indels) and of course making sure the uninformative and low quality sites are removed from the alignment will definitely help increase the bootstrap support in your tree.

What is a bootstrap value in phylogenetics?

The bootstrap value is the proportion of replicate phylogenies that recovered a particular clade from the original phylogeny that was built using the original alignment. The bootstrap value for a clade is the proportion of the replicate trees that recovered that particular clade (fig. 1).


2 Answers

Bootstrapping, in phylogenetics as elsewhere, doesn't improve the quality of whatever you're trying to estimate (a tree in this case). What it does do is give you an idea of how confident you can be about the result you get from your original dataset. A bootstrap analysis answers the question "If I repeated this experiment many times, using a different sample each time (but of the same size), how often would I expect to get the same result?" This is usually broken down by edge ("How often would I expect to see this particular edge in the inferred tree?").

Sampling Error

More precisely, bootstrapping is a way of approximately measuring the expected level of sampling error in your estimate. Most evolutionary models have the property that, if your dataset had an infinite number of sites, you would be guaranteed to recover the correct tree and correct branch lengths*. But with a finite number of sites this guarantee disappears. What you infer in these circumstances can be considered to be the correct tree plus sampling error, where the sampling error tends to decrease as you increase the sample size (number of sites). What we want to know is how much sampling error we should expect for each edge, given that we have (say) 1000 sites.

What We Would Like To Do, But Can't

Suppose you used an alignment of 1000 sites to infer the original tree. If you somehow had the ability to sequence as many sites as you wanted for all your taxa, you could extract another 1000 sites from each and perform this tree inference again, in which case you would probably get a tree that was similar but slightly different to the original tree. You could do this again and again, using a fresh batch of 1000 sites each time; if you did this many times, you would produce a distribution of trees as a result. This is called the sampling distribution of the estimate. In general it will have highest density near the true tree. Also it becomes more concentrated around the true tree if you increase the sample size (number of sites).

What does this distribution tell us? It tells us how likely it is that any given sample of 1000 sites generated by this evolutionary process (tree + branch lengths + other parameters) will actually give us the true tree -- in other words, how confident we can be about our original analysis. As I mentioned above, this probability-of-getting-the-right-answer can be broken down by edge -- that's what "bootstrap probabilities" are.

What We Can Do Instead

We don't actually have the ability to magically generate as many alignment columns as we want, but we can "pretend" that we do, by simply regarding the original set of 1000 sites as a pool of sites from which we draw a fresh batch of 1000 sites with repetition for each replicate. This generally produces a distribution of results that is different from the true 1000-site sampling distribution, but for large site counts the approximation is good.


* That is assuming that the dataset was in fact generated according to this model -- which is something that we cannot know for certain, unless we're doing a simulation. Also some models, like uncorrected parsimony, actually have the paradoxical quality that under some conditions, the more sites you have, the lower the probability of recovering the correct tree!

like image 193
j_random_hacker Avatar answered Sep 21 '22 12:09

j_random_hacker


Bootstrapping is a general statistical technique that has applications outside of bioinformatics. It is a flexible means of coping with small samples, or samples from a complex population (which I imagine is the case in your application.)

like image 20
phs Avatar answered Sep 19 '22 12:09

phs