Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing calculation of frequencies of gametes in populations

I need to optimize calculation of the frequencies of gametes in populations.

I have np populations and Ne individuals in each population. Each individual is formed by two gametes (male and female). Each gamete contains three genes. Each gen may be 0 or 1. So each individual is a 2x3 matrix. Each row of the matrix is a gamete given by one of the parents. The set of individuals in each population may be arbitrary (but always of Ne length). For simplicity initial populations with individuals may be given as:

Ne = 300; np = 3^7;
(*This table may be arbitrary with the same shape*)
ind = Table[{{0, 0, 0}, {1, 1, 1}}, {np}, {Ne}]

Full set of all possible gametes:

allGam = Tuples[{0, 1}, 3]

Each individual can generate a gamete by 8 possible ways with equal probability. These gametes are: Tuples@Transpose@ind[[iPop, iInd]] (where iPop and iInd - indexes of population and of individual in that population). I need to calculate the frequencies of gametes generated by individuals for each population.

At this moment my solution is as follows.

At first, I convert each individual into gametes it can produce:

gamsInPop = Map[Sequence @@ Tuples@Transpose@# &, ind, {2}]

But more efficient way to do this is:

gamsInPop = 
 Table[Join @@ Table[Tuples@Transpose@ind[[i, j]], {j, 1, Ne}], {i, 1, np}]

Secondly, I calculate the frequencies of gametes produced including zero frequencies for gametes that are possible but absent in population:

gamFrq = Table[Count[pop, gam]/(8 Ne), {pop, gamInPop}, {gam, allGam}]

More efficient version of this code:

gamFrq = Total[
   Developer`ToPackedArray[
    gamInPop /. Table[
      allGam[[i]] -> Insert[{0, 0, 0, 0, 0, 0, 0}, 1, i], {i, 1, 
       8}]], {2}]/(8 Ne)

Unfortunately, the code is still too slow. Can anybody help me to speed-up it?

like image 882
Alexey Popkov Avatar asked Aug 25 '11 08:08

Alexey Popkov


1 Answers

This code:

Clear[getFrequencies];
Module[{t = 
   Developer`ToPackedArray[
     Table[FromDigits[#, 2] & /@ 
         Tuples[Transpose[{
            PadLeft[IntegerDigits[i, 2], 3], 
            PadLeft[IntegerDigits[j, 2], 3]}]], 
       {i, 0, 7}, {j, 0, 7}]
    ]},
   getFrequencies[ind_] :=
    With[{extracted = 
       Partition[
          Flatten@Extract[t, Flatten[ind.(2^Range[0, 2]) + 1, 1]], 
          Ne*8]},
        Map[
         Sort@Join[#, Thread[{Complement[Range[0, 7], #[[All, 1]]], 0}]] &@Tally[#] &, 
         extracted
        ][[All, All, 2]]/(Ne*8)
    ]
]

utilizes conversion to decimal numbers and packed arrays, and speeds your code up by a factor of 40 on my machine. The benchmarks:

In[372]:= Ne=300;np=3^7;
(*This table may be arbitrary with the same shape*)
inds=Table[{{0,0,0},{1,1,1}},{np},{Ne}];

In[374]:= 
getFrequencies[inds]//Short//Timing
Out[374]= {0.282,{{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8},<<2185>>,
{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8}}}

In[375]:= 
Timing[
  gamsInPop=Table[Join@@Table[Tuples@Transpose@inds[[i,j]],{j,1,Ne}],{i,1,np}];
  gamFrq=Total[Developer`ToPackedArray[gamsInPop/.Table[allGam[[i]]->
         Insert[{0,0,0,0,0,0,0},1,i],{i,1,8}]],{2}]/(8 Ne)//Short]

Out[375]= {10.563,{{1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8},<<2185>>,
  {1/8,1/8,1/8,1/8,1/8,1/8,1/8,1/8}}}

Note that in general (for random populations), the ordering of frequencies in your and my solutions are for some reason different, and

In[393]:= fr[[All,{1,5,3,7,2,6,4,8}]] == gamFrq
Out[393]= True

Now, some explanation: first, we create a table t, which is constructed as follows: each gamete is assigned a number from 0 to 7, which corresponds to the zeros and ones in it treated as binary digits. The table then has the possible gametes produced by an individual, stored in a position {i,j}, where i is a decimal for mother's gamete (say), and j - for fathers's, for that individual (each individual is uniquely identified by a pair {i,j}). The gametes produced by individual are also converted to decimals. Here is how it looks:

In[396]:= t//Short[#,5]&
Out[396]//Short= {{{0,0,0,0,0,0,0,0},{0,1,0,1,0,1,0,1},{0,0,2,2,0,0,2,2},
{0,1,2,3,0,1,2,3},{0,0,0,0,4,4,4,4},{0,1,0,1,4,5,4,5},{0,0,2,2,4,4,6,6},
{0,1,2,3,4,5,6,7}},<<6>>,{{7,6,5,4,3,2,1,0},{7,7,5,5,3,3,1,1},{7,6,7,6,3,2,3,2},
<<2>>,{7,7,5,5,7,7,5,5},{7,6,7,6,7,6,7,6},{7,7,7,7,7,7,7,7}}}

A very important (crucial) step is to convert this table to a packed array.

The line Flatten[ind.(2^Range[0, 2]) + 1, 1]] converts parents' gametes from binary to decimal for all individuals at once, in all populations, and adds 1 so that these become indices at which the list of possible to produce gametes is stored in a table t for a given individual. We then Extract all of them at once, for all populations, and use Flatten and Partition to recover back the population structure. Then, we compute frequencies with Tally, append missing gametes with frequencies zero (done by Join[#, Thread[{Complement[Range[0, 7], #[[All, 1]]], 0}]] line), and Sort each frequency list for a fixed population. Finally, we extract the frequencies and discard the gamete decimal index.

All operations are pretty fast since performed on packed arrays. The speedup is due to the vectorized formulation of the problem and use of packed arrays. It is also much more memory - efficient.

like image 142
Leonid Shifrin Avatar answered Oct 08 '22 09:10

Leonid Shifrin