I'm trying to solve an exercise from the rosalind project, but keep making some mistake apparently. The full text is available here, but my shorter abstract interpretation and attempt is as follows. Please help me find what am I doing wrong:
We have 3 groups of items: AA
, Aa
, aa
. We start with 1 in Aa
and do k
iterations of generating new items. In every iteration every item in group:
Aa
could produce: AA
(25%), Aa
(50%), aa
(25%)AA
could produce: AA
(50%), Aa
(50%)aa
could produce: aa
(50%), Aa
(50%)As a result of iteration we count expected number of items for each group, assuming we generate 2 new items from each one in the previous iteration. So the we end up with:
AA
: 0, Aa
: 1, aa
: 0AA
: .5, Aa
: 1, aa
: .5AA
: 1, Aa
: 2, aa
: 1The sum of expected values / population on each iteration is 2^iteration
and the probability of an item being in group Aa
is always 50%.
So far I hope I'm right, but what we're actually after is: what are the chances of having at least N
items that are in group Aa
both times if we repeat the experiment twice. (should be equivalent to: what are the chances of having at least N
items in group AaBb
if we extend the list of groups to AABB
, AABb
, .... from the original question)
So the probability of item being in Aa
is 50%, population sum of expected values from iteration (or 2^iteration
), and throwing that at scipy using the test data (k=2, N=1), we get this for just at least one item in group Aa
:
In [75]: bin = scipy.stats.binom(4, .5)
In [76]: sum(b.pmf(x) for x in range(1, 4+1))
Out[76]: 0.93750000000000022
and this for at least one item if we have two sets of groups, so AaBb
:
In [77]: sum(b.pmf(x) for x in range(1, 4+1))**2
Out[77]: 0.87890625000000044
Which is completely different from the answer in the original question: 0.684
Where did I make a mistake? (if possible please only point out the mistake, rather than give a solution, so that there are no spoilers left for people trying to solve it on their own)
I first followed your example and thought it seemed to make sense, but after a while I found where the problem was.
Here is a pointer to your mistake:
You have calculated the probability of getting at least one Aa--
in the second generetion and at least one --Bb
. But this is not enough to find out if there is at least one AaBb
in the second generation, the Aa--
and --Bb
have to coincide.
Consider for example the following second generation: aaBb, AABb, Aabb, AaBB
All individuals have either Aa--
or --Bb
but there are no AaBb
in the generation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With