Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

For C++ Vector3 utility class implementations, is array faster than struct and class?

just out of curiosity, I implemented vector3 utilities in 3 ways: array (with a typedef), class and struct

This is the array implementation:

typedef float newVector3[3];

namespace vec3{
    void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
    void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
    void dot(const newVector3& first, const newVector3& second, float& out_result);
    void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
    }

    // implementations, nothing fancy...really

     void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3)

    {
        out_newVector3[0] = first[0] + second[0];
        out_newVector3[1] = first[1] + second[1];
        out_newVector3[2] = first[2] + second[2];
    }

    void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
        out_newVector3[0] = first[0] - second[0];
        out_newVector3[1] = first[1] - second[1];
        out_newVector3[2] = first[2] - second[2];
    }

    void dot(const newVector3& first, const newVector3& second, float& out_result){
        out_result = first[0]*second[0] + first[1]*second[1] + first[2]*second[2];
    }

    void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
        out_newVector3[0] = first[0] * second[0];
        out_newVector3[1] = first[1] * second[1];
        out_newVector3[2] = first[2] * second[2];
    }
}

And a class implementation:

class Vector3{
private:
    float x;
    float y;
    float z;

public:
    // constructors
    Vector3(float new_x, float new_y, float new_z){
        x = new_x;
        y = new_y;
        z = new_z;
    }

    Vector3(const Vector3& other){
        if(&other != this){
            this->x = other.x;
            this->y = other.y;
            this->z = other.z;
        }
    }
}

Of course, It contains other functionalities that usually appears in a Vector3 class.

And finally, a struct implementation:

struct s_vector3{
    float x;
    float y;
    float z;

    // constructors
    s_vector3(float new_x, float new_y, float new_z){
        x = new_x;
        y = new_y;
        z = new_z;
    }

    s_vector3(const s_vector3& other){
        if(&other != this){
            this->x = other.x;
            this->y = other.y;
            this->z = other.z;
        }
    }

Again, I omitted some other common Vector3 functionalities. Now, I let all three of them create 9000000 new objects, and do 9000000 times of cross product(I wrote a huge chunk of data data to cache after one of them finishes, to avoid cache help them out).

Here is the test code:

const int K_OPERATION_TIME = 9000000;
const size_t bigger_than_cachesize = 20 * 1024 * 1024;

void cleanCache()
{
    // flush the cache
    long *p = new long[bigger_than_cachesize];// 20 MB
    for(int i = 0; i < bigger_than_cachesize; i++)
    {
       p[i] = rand();
    }
}

int main(){

    cleanCache();
    // first, the Vector3 struct
    std::clock_t start;
    double duration;

    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        s_vector3 newVector3Struct = s_vector3(i,i,i);
        newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The struct implementation of Vector3 takes %f seconds.\n", duration);

    cleanCache();
    // second, the Vector3 array implementation
    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        newVector3 newVector3Array = {i, i, i};
        newVector3 opResult;
        vec3::cross(newVector3Array, newVector3Array, opResult);
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The array implementation of Vector3 takes %f seconds.\n", duration);

    cleanCache();
    // Third, the Vector3 class implementation
    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        Vector3 newVector3Class = Vector3(i,i,i);
        newVector3Class = Vector3::cross(newVector3Class, newVector3Class);
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The class implementation of Vector3 takes %f seconds.\n", duration);


    return 0;
}

The result is astonishing.

struct and class implementations finishes the task around 0.23 seconds, whereas array implementation only takes 0.08 seconds!

If array does have significant performance advantage like this, though its syntax would be ugly, it's worth using in a lot of cases.

So I really want to make sure, is this what supposed to happen? Thanks!

like image 897
Wenyu Avatar asked Sep 26 '17 02:09

Wenyu


2 Answers

Short answer: it depends. As you can observe, there is difference if compiled without the optimization.

When I compile (all functions inlined) with optimization on (-O2 or -O3) there is no difference (read on to see, that it is not as easy it seems).

 Optimization    Times (struct vs. array)
    -O0              0.27 vs. 0.12
    -O1              0.14 vs. 0.04
    -O2              0.00 vs. 0.00
    -O3              0.00 vs. 0.00

There is no guarantee, what optimization your compiler can/will do, so the complete answer is "it depends on your compiler". At first I would trust my compiler to do The Right Thing, otherwise I should start programming assembly. Only if this part of the code is a real bottle neck, it is worth to think about helping the compiler.

If compiled with -O2, your code takes exactly 0.0 seconds for both versions, but this is because the optimizers sees, that those values are not used at all, so it just throws away the whole code!

Let's ensure, that this doesn't happen:

#include <ctime>
#include <cstdio>

const int K_OPERATION_TIME = 1000000000;

int main(){
    std::clock_t start;
    double duration;

    start = std::clock();

    double checksum=0.0;
    for(int i = 0; i < K_OPERATION_TIME; ++i){
        s_vector3 newVector3Struct = s_vector3(i,i,i);
        newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
        checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z; // actually using the result of cross-product!
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The struct implementation of Vector3 takes %f seconds.\n", duration);

    // second, the Vector3 array implementation
    start = std::clock();

    for(int i = 0; i < K_OPERATION_TIME; ++i){
        newVector3 newVector3Array = {i, i, i};
        newVector3 opResult;
        vec3::cross(newVector3Array, newVector3Array, opResult);
        checksum+=opResult[0] +opResult[1]+opResult[2];  // actually using the result of cross-product!
    }

    duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
    printf("The array implementation of Vector3 takes %f seconds.\n", duration);

    printf("Checksum: %f\n", checksum);
}

You will see the following changes:

  1. The cache is not involved (there are no cache-misses), so I just deleted the code responsible for flushing it.
  2. There is no difference between class and struct from the performance (after the compilation there is really no difference, the whole public-private syntactic sugar difference is only skin-deep), so I look only at the struct.
  3. The result of the cross product is actually used and cannot be optimized away.
  4. There are now 1e9 iterations, to get meaningful times.

With this change we see the following timings (intel compiler):

 Optimization    Times (struct vs. array)
    -O0              33.2 vs. 17.1
    -O1              19.1 vs. 7.8
    -Os              19.2 vs. 7.9
    -O2              0.7 vs. 0.7
    -O3              0.7 vs. 0.7

I'm a little bit disappointed, that -Os has such a bad performance, but otherwise you can see that if optimized, there is no difference between structs and arrays!


Personally I like -Os a lot, because it produces assembly I'm able to understand, so let's take a look, why it is so slow.

The most obvious thing, without looking into the resulting assembly: s_vector3::cross returns a s_vector3-object but we assign the result to an already existing object, so if the optimizer does not see, that the old object is no longer used, he might not be able to do RVO. So let replace

newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z;

with:

s_vector3 r = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=r.x +r.y+r.z; 

There results now: 2.14 (struct) vs. 7.9 - that is quite an improvement!

My take-away from it: The optimizer makes a great job, but we can help it a little if needed.

like image 69
ead Avatar answered Nov 11 '22 20:11

ead


In this case, no. As far as the CPU is concerned; classes, structures, and arrays are simply memory layouts, and the layout in this case is identical. In non-release builds, if you are using inline methods, they may be compiled into actual functions (primarily to help a debugger step into the methods), so that may have a small impact.

Addition isn't really a good way to test performance of a Vec3 type. Dot and/or cross product is usually a better way of testing.

If you really cared about performance, you'd basically want to take a structure-of-arrays approach (instead of array of structures as you have above). This tends to allow the compiler to apply auto vectorisation.

i.e. instead of this:

constexpr int N = 100000;
struct Vec3 {
  float x, y, z; 
};
inline float dot(Vec3 a, Vec3 b) { return a.x*b.x + a.y*b.y + a.z*b.z; }
void dotLots(float* dps, const Vec3 a[N], const Vec3 b[N])
{
  for(int i = 0; i < N; ++i)
    dps[i] = dot(a[i], b[i]);
}

You'd do this:

constexpr int N = 100000;
struct Vec3SOA {
  float x[N], y[N], z[N]; 
};
void dotLotsSOA(float* dps, const Vec3SOA& a, const Vec3SOA& b)
{
  for(int i = 0; i < N; ++i)
  {
    dps[i] = a.x[i]*b.x[i] + a.y[i]*b.y[i] + a.z[i]*b.z[i];
  }
}

If you compile with -mavx2 and -mfma, then the latter version would optimise quite nicely.

like image 27
robthebloke Avatar answered Nov 11 '22 19:11

robthebloke