just out of curiosity, I implemented vector3 utilities in 3 ways: array (with a typedef), class and struct
This is the array implementation:
typedef float newVector3[3];
namespace vec3{
void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
void dot(const newVector3& first, const newVector3& second, float& out_result);
void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3);
}
// implementations, nothing fancy...really
void add(const newVector3& first, const newVector3& second, newVector3& out_newVector3)
{
out_newVector3[0] = first[0] + second[0];
out_newVector3[1] = first[1] + second[1];
out_newVector3[2] = first[2] + second[2];
}
void subtract(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
out_newVector3[0] = first[0] - second[0];
out_newVector3[1] = first[1] - second[1];
out_newVector3[2] = first[2] - second[2];
}
void dot(const newVector3& first, const newVector3& second, float& out_result){
out_result = first[0]*second[0] + first[1]*second[1] + first[2]*second[2];
}
void cross(const newVector3& first, const newVector3& second, newVector3& out_newVector3){
out_newVector3[0] = first[0] * second[0];
out_newVector3[1] = first[1] * second[1];
out_newVector3[2] = first[2] * second[2];
}
}
And a class implementation:
class Vector3{
private:
float x;
float y;
float z;
public:
// constructors
Vector3(float new_x, float new_y, float new_z){
x = new_x;
y = new_y;
z = new_z;
}
Vector3(const Vector3& other){
if(&other != this){
this->x = other.x;
this->y = other.y;
this->z = other.z;
}
}
}
Of course, It contains other functionalities that usually appears in a Vector3 class.
And finally, a struct implementation:
struct s_vector3{
float x;
float y;
float z;
// constructors
s_vector3(float new_x, float new_y, float new_z){
x = new_x;
y = new_y;
z = new_z;
}
s_vector3(const s_vector3& other){
if(&other != this){
this->x = other.x;
this->y = other.y;
this->z = other.z;
}
}
Again, I omitted some other common Vector3 functionalities. Now, I let all three of them create 9000000 new objects, and do 9000000 times of cross product(I wrote a huge chunk of data data to cache after one of them finishes, to avoid cache help them out).
Here is the test code:
const int K_OPERATION_TIME = 9000000;
const size_t bigger_than_cachesize = 20 * 1024 * 1024;
void cleanCache()
{
// flush the cache
long *p = new long[bigger_than_cachesize];// 20 MB
for(int i = 0; i < bigger_than_cachesize; i++)
{
p[i] = rand();
}
}
int main(){
cleanCache();
// first, the Vector3 struct
std::clock_t start;
double duration;
start = std::clock();
for(int i = 0; i < K_OPERATION_TIME; ++i){
s_vector3 newVector3Struct = s_vector3(i,i,i);
newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The struct implementation of Vector3 takes %f seconds.\n", duration);
cleanCache();
// second, the Vector3 array implementation
start = std::clock();
for(int i = 0; i < K_OPERATION_TIME; ++i){
newVector3 newVector3Array = {i, i, i};
newVector3 opResult;
vec3::cross(newVector3Array, newVector3Array, opResult);
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The array implementation of Vector3 takes %f seconds.\n", duration);
cleanCache();
// Third, the Vector3 class implementation
start = std::clock();
for(int i = 0; i < K_OPERATION_TIME; ++i){
Vector3 newVector3Class = Vector3(i,i,i);
newVector3Class = Vector3::cross(newVector3Class, newVector3Class);
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The class implementation of Vector3 takes %f seconds.\n", duration);
return 0;
}
The result is astonishing.
struct
and class
implementations finishes the task around 0.23 seconds,
whereas array
implementation only takes 0.08 seconds!
If array does have significant performance advantage like this, though its syntax would be ugly, it's worth using in a lot of cases.
So I really want to make sure, is this what supposed to happen? Thanks!
Short answer: it depends. As you can observe, there is difference if compiled without the optimization.
When I compile (all functions inlined) with optimization on (-O2
or -O3
) there is no difference (read on to see, that it is not as easy it seems).
Optimization Times (struct vs. array)
-O0 0.27 vs. 0.12
-O1 0.14 vs. 0.04
-O2 0.00 vs. 0.00
-O3 0.00 vs. 0.00
There is no guarantee, what optimization your compiler can/will do, so the complete answer is "it depends on your compiler". At first I would trust my compiler to do The Right Thing, otherwise I should start programming assembly. Only if this part of the code is a real bottle neck, it is worth to think about helping the compiler.
If compiled with -O2
, your code takes exactly 0.0
seconds for both versions, but this is because the optimizers sees, that those values are not used at all, so it just throws away the whole code!
Let's ensure, that this doesn't happen:
#include <ctime>
#include <cstdio>
const int K_OPERATION_TIME = 1000000000;
int main(){
std::clock_t start;
double duration;
start = std::clock();
double checksum=0.0;
for(int i = 0; i < K_OPERATION_TIME; ++i){
s_vector3 newVector3Struct = s_vector3(i,i,i);
newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z; // actually using the result of cross-product!
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The struct implementation of Vector3 takes %f seconds.\n", duration);
// second, the Vector3 array implementation
start = std::clock();
for(int i = 0; i < K_OPERATION_TIME; ++i){
newVector3 newVector3Array = {i, i, i};
newVector3 opResult;
vec3::cross(newVector3Array, newVector3Array, opResult);
checksum+=opResult[0] +opResult[1]+opResult[2]; // actually using the result of cross-product!
}
duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC;
printf("The array implementation of Vector3 takes %f seconds.\n", duration);
printf("Checksum: %f\n", checksum);
}
You will see the following changes:
1e9
iterations, to get meaningful times.With this change we see the following timings (intel compiler):
Optimization Times (struct vs. array)
-O0 33.2 vs. 17.1
-O1 19.1 vs. 7.8
-Os 19.2 vs. 7.9
-O2 0.7 vs. 0.7
-O3 0.7 vs. 0.7
I'm a little bit disappointed, that -Os
has such a bad performance, but otherwise you can see that if optimized, there is no difference between structs and arrays!
Personally I like -Os
a lot, because it produces assembly I'm able to understand, so let's take a look, why it is so slow.
The most obvious thing, without looking into the resulting assembly: s_vector3::cross
returns a s_vector3
-object but we assign the result to an already existing object, so if the optimizer does not see, that the old object is no longer used, he might not be able to do RVO. So let replace
newVector3Struct = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=newVector3Struct.x +newVector3Struct.y+newVector3Struct.z;
with:
s_vector3 r = s_vector3::cross(newVector3Struct, newVector3Struct);
checksum+=r.x +r.y+r.z;
There results now: 2.14 (struct) vs. 7.9
- that is quite an improvement!
My take-away from it: The optimizer makes a great job, but we can help it a little if needed.
In this case, no. As far as the CPU is concerned; classes, structures, and arrays are simply memory layouts, and the layout in this case is identical. In non-release builds, if you are using inline methods, they may be compiled into actual functions (primarily to help a debugger step into the methods), so that may have a small impact.
Addition isn't really a good way to test performance of a Vec3 type. Dot and/or cross product is usually a better way of testing.
If you really cared about performance, you'd basically want to take a structure-of-arrays approach (instead of array of structures as you have above). This tends to allow the compiler to apply auto vectorisation.
i.e. instead of this:
constexpr int N = 100000;
struct Vec3 {
float x, y, z;
};
inline float dot(Vec3 a, Vec3 b) { return a.x*b.x + a.y*b.y + a.z*b.z; }
void dotLots(float* dps, const Vec3 a[N], const Vec3 b[N])
{
for(int i = 0; i < N; ++i)
dps[i] = dot(a[i], b[i]);
}
You'd do this:
constexpr int N = 100000;
struct Vec3SOA {
float x[N], y[N], z[N];
};
void dotLotsSOA(float* dps, const Vec3SOA& a, const Vec3SOA& b)
{
for(int i = 0; i < N; ++i)
{
dps[i] = a.x[i]*b.x[i] + a.y[i]*b.y[i] + a.z[i]*b.z[i];
}
}
If you compile with -mavx2 and -mfma, then the latter version would optimise quite nicely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With