Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it faster to calculate the product of a consecutive array of integers by performing the calculation in pairs?

I was trying to create my own factorial function when I found that the that the calculation is twice as fast if it is calculated in pairs. Like this:

Groups of 1: 2*3*4 ... 50000*50001 = 4.1 seconds

Groups of 2: (2*3)*(4*5)*(6*7) ... (50000*50001) = 2.0 seconds

Groups of 3: (2*3*4)*(5*6*7) ... (49999*50000*50001) = 4.8 seconds

Here is the c# I used to test this.

Stopwatch timer = new Stopwatch();
timer.Start();

// Seperate the calculation into groups of this size.
int k = 2;

BigInteger total = 1;

// Iterates from 2 to 50002, but instead of incrementing 'i' by one, it increments it 'k' times,
// and the inner loop calculates the product of 'i' to 'i+k', and multiplies 'total' by that result.
for (var i = 2; i < 50000 + 2; i += k)
{
    BigInteger partialTotal = 1;
    for (var j = 0; j < k; j++)
    {
        // Stops if it exceeds 50000.
        if (i + j >= 50000) break;
        partialTotal *= i + j;
    }
    total *= partialTotal;
}

Console.WriteLine(timer.ElapsedMilliseconds / 1000.0 + "s");

I tested this at different levels and put the average times over a few tests in a bar graph. I expected it to become more efficient as I increased the number of groups, but 3 was the least efficient and 4 had no improvement over groups of 1.

Bar Plot showing the calculation time with different group amounts.

Link to First Data

Link to Second Data

What causes this difference, and is there an optimal way to calculate this?

like image 862
Will Hamic Avatar asked Aug 22 '16 19:08

Will Hamic


2 Answers

BigInteger has a fast case for numbers of 31 bits or less. When you do a pairwise multiplication, this means a specific fast-path is taken, that multiplies the values into a single ulong and sets the value more explicitly:

public void Mul(ref BigIntegerBuilder reg1, ref BigIntegerBuilder reg2) {
  ...
  if (reg1._iuLast == 0) {
    if (reg2._iuLast == 0)
      Set((ulong)reg1._uSmall * reg2._uSmall);
    else {
      ...
    }
  }
  else if (reg2._iuLast == 0) {
    ...
  }
  else {
    ...
  }
}
public void Set(ulong uu) {
  uint uHi = NumericsHelpers.GetHi(uu);
  if (uHi == 0) {
    _uSmall = NumericsHelpers.GetLo(uu);
    _iuLast = 0;
  }
  else {
    SetSizeLazy(2);
    _rgu[0] = (uint)uu;
    _rgu[1] = uHi;
  }
  AssertValid(true);
}

A 100% predictable branch like this is perfect for a JIT, and this fast-path should get optimized extremely well. It's possible that _rgu[0] and _rgu[1] are even inlined. This is extremely cheap, so effectively cuts down the number of real operations by a factor of two.

So why is a group of three so much slower? It's obvious that it should be slower than for k = 2; you have far fewer optimized multiplications. More interesting is why it's slower than k = 1. This is easily explained by the fact that the outer multiplication of total now hits the slow path. For k = 2 this impact is mitigated by halving the number of multiplies and the potential inlining of the array.

However, these factors do not help k = 3, and in fact the slow case hurts k = 3 a lot more. The second multiplication in the k = 3 case hits this case

  if (reg1._iuLast == 0) {
    ...
  }
  else if (reg2._iuLast == 0) {
    Load(ref reg1, 1);
    Mul(reg2._uSmall);
  }
  else {
    ...
  }

which allocates

  EnsureWritable(1);

  uint uCarry = 0;
  for (int iu = 0; iu <= _iuLast; iu++)
    uCarry = MulCarry(ref _rgu[iu], u, uCarry);

  if (uCarry != 0) {
    SetSizeKeep(_iuLast + 2, 0);
    _rgu[_iuLast] = uCarry;
  }

why does this matter? Well, EnsureWritable(1) causes

uint[] rgu = new uint[_iuLast + 1 + cuExtra];

so rgu becomes length 3. The number of passes in total's code is decided in

public void Mul(ref BigIntegerBuilder reg1, ref BigIntegerBuilder reg2)

as

    for (int iu1 = 0; iu1 < cu1; iu1++) {
      ...
      for (int iu2 = 0; iu2 < cu2; iu2++, iuRes++)
        uCarry = AddMulCarry(ref _rgu[iuRes], uCur, rgu2[iu2], uCarry);
      ...
    }

which means that we have a total of len(total._rgu) * 3 operations. This hasn't saved us anything! There are only len(total._rgu) * 1 passes for k = 1 - we just do it three times!

There is actually an optimization on the outer loop that reduces this back down to len(total._rgu) * 2:

      uint uCur = rgu1[iu1];
      if (uCur == 0)
        continue;

However, they "optimize" this optimization in a way that hurts far more than before:

if (reg1.CuNonZero <= reg2.CuNonZero) {
  rgu1 = reg1._rgu; cu1 = reg1._iuLast + 1;
  rgu2 = reg2._rgu; cu2 = reg2._iuLast + 1;
}
else {
  rgu1 = reg2._rgu; cu1 = reg2._iuLast + 1;
  rgu2 = reg1._rgu; cu2 = reg1._iuLast + 1;
}

For k = 2, that causes the outer loop to be over total, since reg2 contains no zero values with high probability. This is great because total is way longer than partialTotal, so the fewer passes the better. For k = 3, the EnsureWritable(1) will always cause a spare space because the multiplication of three numbers no more than 15 bits long can never exceed 64 bits. This means that, although we still only do one pass over total for k = 2, we do two for k = 3!

This starts to explain why the speed increases again beyond k = 3: the number of passes per addition increases slower than the number of additions decreases, as you're only adding ~15 bits to the inner value each time. The inner multiplications are fast relative to the massive total multiplications, so the more time spent consolidating values, the more time saved in passes over total. Further, the optimization is less frequently a pessimism.

It also explains why odd values take longer: they add an extra 32-bit integer to the _rgu array. This won't happen so cleanly if the ~15 bits wasn't so close to half of 32.


It's worth noting that there are a lot of ways to improve this code; the comments here are about why, not how to fix it. The easiest improvement would be to chuck the values in a heap and multiply only the two smallest values at a time.

like image 60
Veedrac Avatar answered Oct 14 '22 04:10

Veedrac


The time required to do a BigInteger multiplication depends on the size of the product.

Both methods take the same number of multiplications, but if you multiply the factors in pairs, then the average size of the product is much smaller than it is if you multiply each factor with the product of all the smaller ones.

You can do even better if you always multiply the two smallest factors (original factors or intermediate products) that have yet to be multiplied together, until you get to the complete product.

like image 39
Matt Timmermans Avatar answered Oct 14 '22 04:10

Matt Timmermans