Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterator equivalent to nested for loops shows 50 % performance breakdown - why?

What I try to achieve/figure out

I try to figure out the best way to write a general container (vectors, matrices, higher-dimensional objects) for an arbitrary number of dimensions. The number of dimensions as well as the number of elements per dimension shall be specified at compile time like this:

  • 3 gives a vector with 3 elements
  • 10, 10 gives a matrix with 100 elements
  • 7, 5, 3 gives a second-order tensor with 105 elements
  • ...

One should be able to loop over all elements in an easy way for simple manipulations: multiply all (double) elements by a scalar double, add two compatible containers element-wise etc. Additionally, one should be able to loop over all elements knowing their respective index per dimension, e.g. from (0, 0, 0) to (6, 4, 2) for a tensor.

My attempt: Recursive iterator nesting with variadic templates

I thought that variadic template parameters are a nice tool for recursively connecting iterators for each dimension.

template<
  int N, // the number of elements to iterate for this dimension
  int... otherN // the other dimension's lengths
> class dimIterator;

For storing the pointer to the first element, I had a boxIterator in mind which is just a wrapper for the dimIterators

template<
  typename dataType,
  int... allN
> class boxIterator : public dimIterator<allN...> { // tried as member or inheritance
  protected:
    dataType* data;

  public:
    boxIterator(dataType* data) :
      dimIterator<allN...>(),
      data(data)
    {
      //cout << "constructing box iterator." << endl;
    }

    int getIndex() const {
      int result = dimIterator<allN...>::getIndex();
      return result;
    }

    dataType* getPtr() {
      dataType* result = data + this->getIndex();
      return result;
    }
    const dataType* getPtr() const {
      dataType* result = data + this->getIndex();
      return result;
    }

    bool isDone() const {return dimIterator<allN...>::isDone();}

    boxIterator<dataType, allN...>& operator ++() {
      dimIterator<allN...>::operator++();
      return *this;
    }

    dataType& operator *() {return *(this->getPtr());}
    const dataType& operator *() const {return *(this->getPtr());}
};

template<
  int N, // the number of elements to iterate for this dimension
  int... otherN // the other dimension's lengths
> class dimIterator : public dimIterator<otherN...> { // tried as member or inheritance
  protected:
    int i;

  public:
    dimIterator() :
      dimIterator<otherN...>(),
      i(0)
    {
      //cout << "constructing level with " << N << " gridpoints." << endl;
    }

    int getIndex() const {
      int result = i + N*dimIterator<otherN...>::getIndex();
      return result;
    }

    bool isDone() const {return dimIterator<otherN...>::isDone();}

    dimIterator<N, otherN...>& operator ++() {
      if(i<N-1) {
        ++i;
      }
      else {
        i = 0;
        dimIterator<otherN...>::operator++();
      }
      return *this;
    }
};

template<int N> // stop recursion if only one dimension left
class dimIterator<N> {
  protected:
    int i;

  public:
    dimIterator() :
      i(0)
    {
      //cout << "constructing last level with " << N << " gridpoints." << endl;
    }

    int getIndex() const {
      int result = i;
      return result;
    }

    bool isDone() const {return ( i>= N );}

    dimIterator<N>& operator ++() {
      ++i;
      return *this;
    }
};

Initially, I was quite happy with that approach because it allowed to write the same high-level iterator loop for an arbitrary number of dimensions. The indices for each dimension and similar things like the pitch for a given box in a certain dimension could be obtained easily.

The problem with that approach

However, eventhough I tried to use templates and inline functions for the iterator logic, the compiler does not optimize the stuff to something which is as fast as executing nested for loops. I had a test with an empty multiplication of uninitialized doubles and repeatable get

  • 2.5 s with nested native for loops
  • 5 s with high-level iterator access

My questions

  • What keeps the compiler from treating the method as an equivalent of nested for loops?
  • What would be a general, high-performance approach to this kind of problem? I guess there must be a light-weight library.

Full code and compile info

Compiled with g++ -O3 -std=c++11. Version is g++ (GCC) 4.8.1.

Full code (partially repeated from above):

#include <iostream>

using namespace std;

template<int first, int... other>
class integerPack {
  public:
    constexpr static int length = 1 + integerPack<other...>::length;
    constexpr static int product = first*integerPack<other...>::product;
};

template<int only>
class integerPack<only> {
  public:
    constexpr static int length = 1;
    constexpr static int product = only;
};

template<
  int N, // the number of elements to iterate for this dimension
  int... otherN // the other dimension's lengths
> class dimIterator;

template<
  typename dataType,
  int... allN
> class boxIterator : public dimIterator<allN...> { // tried as member or inheritance
  protected:
    dataType* data;

  public:
    boxIterator(dataType* data) :
      dimIterator<allN...>(),
      data(data)
    {
      //cout << "constructing box iterator." << endl;
    }

    int getIndex() const {
      int result = dimIterator<allN...>::getIndex();
      return result;
    }

    dataType* getPtr() {
      dataType* result = data + this->getIndex();
      return result;
    }
    const dataType* getPtr() const {
      dataType* result = data + this->getIndex();
      return result;
    }

    bool isDone() const {return dimIterator<allN...>::isDone();}

    boxIterator<dataType, allN...>& operator ++() {
      dimIterator<allN...>::operator++();
      return *this;
    }

    dataType& operator *() {return *(this->getPtr());}
    const dataType& operator *() const {return *(this->getPtr());}
};

template<
  int N, // the number of elements to iterate for this dimension
  int... otherN // the other dimension's lengths
> class dimIterator : public dimIterator<otherN...> { // tried as member or inheritance
  protected:
    int i;

  public:
    dimIterator() :
      dimIterator<otherN...>(),
      i(0)
    {
      //cout << "constructing level with " << N << " gridpoints." << endl;
    }

    int getIndex() const {
      int result = i + N*dimIterator<otherN...>::getIndex();
      return result;
    }

    bool isDone() const {return dimIterator<otherN...>::isDone();}

    dimIterator<N, otherN...>& operator ++() {
      if(i<N-1) {
        ++i;
      }
      else {
        i = 0;
        dimIterator<otherN...>::operator++();
      }
      return *this;
    }
};

template<int N> // stop recursion if only one dimension left
class dimIterator<N> {
  protected:
    int i;

  public:
    dimIterator() :
      i(0)
    {
      //cout << "constructing last level with " << N << " gridpoints." << endl;
    }

    int getIndex() const {
      int result = i;
      return result;
    }

    bool isDone() const {return ( i>= N );}

    dimIterator<N>& operator ++() {
      ++i;
      return *this;
    }
};

template<
  int... allN
> class box {
  public:
    constexpr static int dimension = integerPack<allN...>::length;
    constexpr static int NN = integerPack<allN...>::product;

    template<typename dataType>
    using iterator = boxIterator<dataType, allN...>;
};

template<typename dataType, typename boxType>
class boxQuantity {
  public:
    typedef typename boxType::template iterator<dataType> iterator;
    constexpr static int dimension = boxType::dimension;
    constexpr static int NN = boxType::NN;

    typedef boxQuantity<dataType, boxType> thisClass;

  protected:
    boxType mybox;
    dataType* data;
    iterator myit;

  public:
    boxQuantity(
      const boxType& mybox
    ) :
      mybox(mybox),
      data(new dataType[NN]),
      myit(data)
    {
      cout << "I am a quantity of dimension " << dimension
        << " with " << NN << " gridpoints." << endl;
    }

    virtual ~boxQuantity() {
      delete[] data;
    }

    iterator begin() const {
      iterator it(data);
      return it;
    }

    dataType& operator [] (int i) {return data[i];}
    const dataType& operator [] (int i) const {return data[i];}

    // iterator syntax with recursive for-like logic: 5 s
    virtual thisClass& operator *= (const thisClass& rhs) {
      thisClass& lhs = *this;
      for(iterator it=lhs.begin(); !it.isDone(); ++it) {
        lhs[myit.getIndex()] *= rhs[myit.getIndex()];
      }
      return *this;
    }

    // plain nested native for loops: 2.5 s
    /*virtual thisClass& operator *= (const thisClass& rhs) {
      thisClass& lhs = *this;
      for(int yindex=0; yindex<1000; ++yindex) {
        for(int xindex=0; xindex<1000; ++xindex) {
          lhs[yindex*1000 + xindex] *= rhs[yindex*1000 + xindex];
        }
      }
      return *this;
    }*/
};

typedef box<1000, 1000> boxType;
typedef boxQuantity<double, boxType> qType;

int main() {
  boxType qBox;

  qType q1(qBox);
  qType q2(qBox);

  for(int i=0; i<2000; ++i) {
    q1 *= q2;
  }

  return 0;
}

Improvement attempts after reading comments/answers

@cdmh concerning virtual keyword

Thanks for your answer. The code was assembled by taking pieces of a bigger program where the virtual is not that obviously useless as in my example. I expected that the virtual mainly influences the time for calling the function/operator, but not for executing it. As I know that the main part of the execution time is spent inside the loops, I forgot to remove the virtual. However, removing the virtual does not cause a significant performance benefit in my case? (I tested it after your advice.)

@Yakk concerning linearly iterate over a contiguous block of memory

What I probably missed to point out is that I would like to iterate over all elements in a contiguous block of memory and to be able to get the n-dimensional indices for each element. Furthermore, I would like to be able to access the neighbour elements in the direction of a given dimension. Maybe it was a bad idea to remove those functions like e.g. (inside dimIterator)

    template<int dindex>
    int getCoord() const {
      if(1 == dindex) {
        return i;
      }
      return otherDims.template getCoord<dindex-1>();
    }

My previous idea was to

  1. have a base class for all general operations where one does not need to know the respective indices and then
  2. to provide specialized classes for each dimension (by CRTP abuse for reusing the general code?).

However, I would prefer a general solution like above. I can not imagine that it has to be that much slower because I do not see the effective difference compared to nested loops.

like image 487
Julius Avatar asked Nov 03 '22 17:11

Julius


1 Answers

The boxQuantity's virtual operator*=() will not be inlined, which prevents a lot of optimizations. In your example you don't derive from this class. Removing this virtual operator enables the code to be inlined and the SIMD instructions to be used:

00861330  movsd       xmm0,mmword ptr [edx+eax*8]  
00861335  mulsd       xmm0,mmword ptr [edi+eax*8]  
0086133A  movsd       mmword ptr [edi+eax*8],xmm0  
0086133F  dec         ecx  
00861340  jne         main+90h (0861330h)  
like image 51
cdmh Avatar answered Nov 15 '22 05:11

cdmh