Explaining performance difference in two nearly identical algorithms

Question

This question is rather vague and I don't really need an answer to it, but I am very curious about what the answer might be, so I'll ask it anyway.

I have an algorithm which generates a huge amount of matrices. It later runs a second algorithm over it which generates a solution. I ran it 100 times and it took an average of ~17 seconds.

The second algorithm does nearly exactly the same, with the only difference being, that the second algorithm is run over each matrix as soon as it is generated, so that they actually never need to be stored anywhere. This variant obviously needs much less space, which is why I made it, but it also needs an average of only ~2 seconds for the same problem.

I didn't expect it to run faster, especially not that much.

The code is quite big, so I will try to outline the difference in something resembling pseudo-code:

recursiveFill(vector<Matrix> &cache, Matrix permutation) {
  while(!stopCondition) {
    // generate next matrix from current permutation
    if(success)
      cache.push_back(permutation);
    else
      recursiveFill(cache, permutation);
    // some more code
  }
}

recursiveCheck(Matrix permutation) {
  while(!stopCondition) {
    // alter the matrix some
    if(success)
      checkAlgorithm(permutation);
    else
      recursiveCheck(permutation);
    // some more code
  }
}

After the recursive fill, a loop runs the checkAlgorithm over all elements in the cache. Everything I didn't include in the code is identical in both algorithms. I guessed that the storing in the vector is what eats up all the time, but if i recall correctly, the size of a c++ vector doubles each time it is overfilled, so a reallocation shouldn't happen too often. Any ideas?

Dan · Accepted Answer

The culprit here is probably temporal locality. You CPU cache is only so big, so when you save off everything after each run and come back to it later, it has left your CPU caches in the meanwhile and takes longer (10s to 100s of cycles) to access. With the second method, it's right there in L1 (or possibly MMX registers) and takes only a cycle or two to access.

In optimization, you generally want to think like the Wu-Tang Clan: Cache Rules Everything Around Me.

Some people have done testing on this, and copies in cache are often much cheaper than dereferences into main memory.

Explaining performance difference in two nearly identical algorithms

Tags:

c++

performance

benchmarking

vector

Arne

1 Answers

Dan

Recent Activity

Donate For Us

Explaining performance difference in two nearly identical algorithms

Tags:

c++

performance

benchmarking

vector

Arne

1 Answers

Dan

Related questions

Recent Activity

Donate For Us