How does bootstrapping improve the quality of a phylogenetic reconstruction?

Tags:

My understanding of bootstrapping is that you

Build a "tree" using some algorithm from a matrix of sequences (nucleotides, lets say).
You store that tree.
Perturb the matrix from 1, and rebuild the tree.

My question is: what is the purpose of 3 from a sequence bioinformatics perspective? I can try to "guess" that, by changing characters in the original matrix, you can remove artifacts in the data? But I have a problem with that guess: I am not sure, why removal of such artifacts is necessary. A sequence alignment is supposed to deal with artifacts by finding long lenghts of similarity, by its very nature.

842

asked Oct 12 '11 02:10

jayunit100

2 Answers

Bootstrapping, in phylogenetics as elsewhere, doesn't improve the quality of whatever you're trying to estimate (a tree in this case). What it does do is give you an idea of how confident you can be about the result you get from your original dataset. A bootstrap analysis answers the question "If I repeated this experiment many times, using a different sample each time (but of the same size), how often would I expect to get the same result?" This is usually broken down by edge ("How often would I expect to see this particular edge in the inferred tree?").

Sampling Error

More precisely, bootstrapping is a way of approximately measuring the expected level of sampling error in your estimate. Most evolutionary models have the property that, if your dataset had an infinite number of sites, you would be guaranteed to recover the correct tree and correct branch lengths*. But with a finite number of sites this guarantee disappears. What you infer in these circumstances can be considered to be the correct tree plus sampling error, where the sampling error tends to decrease as you increase the sample size (number of sites). What we want to know is how much sampling error we should expect for each edge, given that we have (say) 1000 sites.

What We Would Like To Do, But Can't

Suppose you used an alignment of 1000 sites to infer the original tree. If you somehow had the ability to sequence as many sites as you wanted for all your taxa, you could extract another 1000 sites from each and perform this tree inference again, in which case you would probably get a tree that was similar but slightly different to the original tree. You could do this again and again, using a fresh batch of 1000 sites each time; if you did this many times, you would produce a distribution of trees as a result. This is called the sampling distribution of the estimate. In general it will have highest density near the true tree. Also it becomes more concentrated around the true tree if you increase the sample size (number of sites).

What does this distribution tell us? It tells us how likely it is that any given sample of 1000 sites generated by this evolutionary process (tree + branch lengths + other parameters) will actually give us the true tree -- in other words, how confident we can be about our original analysis. As I mentioned above, this probability-of-getting-the-right-answer can be broken down by edge -- that's what "bootstrap probabilities" are.

What We Can Do Instead

We don't actually have the ability to magically generate as many alignment columns as we want, but we can "pretend" that we do, by simply regarding the original set of 1000 sites as a pool of sites from which we draw a fresh batch of 1000 sites with repetition for each replicate. This generally produces a distribution of results that is different from the true 1000-site sampling distribution, but for large site counts the approximation is good.

* That is assuming that the dataset was in fact generated according to this model -- which is something that we cannot know for certain, unless we're doing a simulation. Also some models, like uncorrected parsimony, actually have the paradoxical quality that under some conditions, the more sites you have, the lower the probability of recovering the correct tree!

193

answered Sep 21 '22 12:09

j_random_hacker

Bootstrapping is a general statistical technique that has applications outside of bioinformatics. It is a flexible means of coping with small samples, or samples from a complex population (which I imagine is the case in your application.)

answered Sep 19 '22 12:09

phs

Related questions
                            
                                mersenne twister - is there a way to jump to a particular state?
                            
                                Facebook Hacker Cup: After the Dance Battle
                            
                                Java problem time limit exceeded issue
                            
                                Produce MD5 or SHA1 hash code to long (64 bits)
                            
                                Modifying a heap in O(lgn) time
                            
                                Is it possible to convert this recursive solution (to print brackets) to an iterative version?
                            
                                Compact data structure for storing a large set of integral values
                            
                                Trying to understand Quadtree concept and apply it to storing coloring info of an image
                            
                                Open-Source compression algorithm with Checkpoints [closed]
                            
                                Sorting a tuple based on one of the fields
                            
                                A Value Based Heatmap Algorithm
                            
                                Why is this a greedy algorithm?
                            
                                Merge of Skip Lists
                            
                                Efficient point inside rectangle boundaries search
                            
                                Datastructures where Add, Get kth largest are O(log n) and O(1)
                            
                                Large integer radix/base conversion from 10^x to 2^x
                            
                                What is the best way to compare data when importing to database?
                            
                                Simple way to calculate point of intersection between two polygons in C#
                            
                                Intersection points of line bisector with rectangle
                            
                                Good compression algorithm for small chunks of data? (around 2k in size)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does bootstrapping improve the quality of a phylogenetic reconstruction?

Tags:

algorithm

bioinformatics

phylogeny