I found a question somewhat interesting, and went on an attempt to answer it. The author wants to compile -one- source file (which relies on template libraries) with AVX optimizations, and the rest of the project without those. So, to see what would happen, I've created a test project like this: main.cpp <pre class="prettyprint"><code>#include <iostream> #include <string> #include "fn_normal.h" #include "fn_avx.h" int main(int argc, char* argv[]) { int number = 10; // this will come from input, but let's keep it simple for now int result; if (std::string(argv[argc - 1]) == "--noavx") result = FnNormal(number); else { std::cout << "AVX selected\n"; result = FnAVX(number); } std::cout << "Double of " << number << " is " << result << std::endl; return 0; } </code></pre> Files fn_normal.h and fn_avx.h contains declarations for functions <code>FnNormal()</code> and <code>FnAVX()</code> respectively, which are defined as follows: fn_normal.cpp <pre class="prettyprint"><code>#include "fn_normal.h" #include "double.h" int FnNormal(int num) { return RtDouble(num); } </code></pre> fn_avx.cpp <pre class="prettyprint"><code>#include "fn_avx.h" #include "double.h" int FnAVX(int num) { return RtDouble(num); } </code></pre> And here's the template function definition: double.h <pre class="prettyprint"><code>template<typename T> int RtDouble(T number) { // Side effect: generates avx instructions const int N = 1000; float a[N], b[N]; for (int n = 0; n < N; ++n) { a[n] = b[n] * b[n] * b[n]; } return number * 2; } </code></pre> Ultimately, I set <code>Enhanced Instruction Set</code> to <code>AVX</code> for the file fn_avx.cpp under "Properties-> C/C++ -> Code Generation", leaving it to <code>Not Set</code> for the other sources, thus it should default to SSE2. I thought that by doing so, the compiler would instantiate the template once for each source that includes it (and avoid violating the One-Definition Rule by mangling the template function name or some other way), and thus calling the program with the <code>--noavx</code> parameter would make it run fine in cpus without avx support. But the resulting program will actualy have only one machine-code version of the function, with avx instructions, and will fail on older cpus. Disabling all other optimizations doesn't solve this issue. Also tried <code>No Enhanced Instructions - /arch:IA32</code> instead of <code>Not Set</code> as well. As I'm just now beginning to understand templates and such, could someone point to me the exact details for this behavior and what I could actually do to achieve my goal? My compiler is MSVC 2013. Additional info: the .obj files for both fn_normal.cpp and fn_avx.cpp are almost the same size in bytes. I've looked into the generated assembly listings and they are almost the same, with the important difference that the avx-enabled source replaces default sse's <code>movss/mulss</code> with <code>vmovss</code> and <code>vmulss</code>, respectively. But stepping throught the code in Visual Studio's disassembly view (<kbd>Ctrl</kbd>+<kbd>Alt</kbd>+<kbd>D</kbd>), confirms that <code>fnNormal()</code> indeed makes use of the avx specialized instructions.

I've worked around this problem successfully by forcing any templated functions that will be used with different compiler options in different source files to be inline. Just using the inline keyword is usually not sufficient, since the compiler will sometimes ignore it for functions larger than some threshold, so you have to force the compiler to do it. In MSVC++: <pre class="prettyprint"><code>template<typename T> __forceinline int RtDouble(T number) {...} </code></pre> GCC: <pre class="prettyprint"><code>template<typename T> inline __attribute__((always_inline)) int RtDouble(T number) {...} </code></pre> Keep in mind you may have to forceinline any other functions that RtDouble may call within the same module in order to keep the compiler flags consistent in those functions as well. Also keep in mind that MSVC++ simply ignores __forceinline when optimizations are disabled, such as in debug builds, and in those cases this trick won't work, so expect different behavior in non-optimized builds. It can make things problematic to debug in any case, but it does indeed work so long as the compiler allows inlining.

I think the simplest solution is to let the compiler know that those functions are indeed intended to be different, by using a template parameter that does nothing but distinguish them: File <code>double.h</code>: <pre class="prettyprint"><code>template<bool avx, typename T> int RtDouble(T number) { // Side effect: generates avx instructions const int N = 1000; float a[N], b[N]; for (int n = 0; n < N; ++n) { a[n] = b[n] * b[n] * b[n]; } return number * 2; } </code></pre> File <code>fn_normal.cpp</code>: <pre class="prettyprint"><code>#include "fn_normal.h" #include "double.h" int FnNormal(int num) { return RtDouble<false>(num); } </code></pre> File <code>fn_avx.cpp</code>: <pre class="prettyprint"><code>#include "fn_avx.h" #include "double.h" int FnAVX(int num) { return RtDouble<true>(num); } </code></pre>

How to conditionally set compiler optimization for template headers

Tags:

c++

compiler-optimization

templates

visual-c++

I found a question somewhat interesting, and went on an attempt to answer it. The author wants to compile -one- source file (which relies on template libraries) with AVX optimizations, and the rest of the project without those.

So, to see what would happen, I've created a test project like this:

main.cpp

#include <iostream>
#include <string>
#include "fn_normal.h"
#include "fn_avx.h"

int main(int argc, char* argv[])
{   
    int number = 10; // this will come from input, but let's keep it simple for now
    int result;

    if (std::string(argv[argc - 1]) == "--noavx")
        result = FnNormal(number);
    else
    {
        std::cout << "AVX selected\n";
        result = FnAVX(number);
    }

    std::cout << "Double of " << number << " is " << result << std::endl;

    return 0;
}

Files fn_normal.h and fn_avx.h contains declarations for functions FnNormal() and FnAVX() respectively, which are defined as follows:

fn_normal.cpp

#include "fn_normal.h"
#include "double.h"

int FnNormal(int num)
{
    return RtDouble(num);
}

fn_avx.cpp

#include "fn_avx.h"
#include "double.h"

int FnAVX(int num)
{
    return RtDouble(num);
}

And here's the template function definition:

double.h

template<typename T>
int RtDouble(T number)
{
    // Side effect: generates avx instructions
    const int N = 1000;
    float a[N], b[N];
    for (int n = 0; n < N; ++n)
    {
        a[n] = b[n] * b[n] * b[n];
    }    
    return number * 2;
}

Ultimately, I set Enhanced Instruction Set to AVX for the file fn_avx.cpp under "Properties-> C/C++ -> Code Generation", leaving it to Not Set for the other sources, thus it should default to SSE2.

I thought that by doing so, the compiler would instantiate the template once for each source that includes it (and avoid violating the One-Definition Rule by mangling the template function name or some other way), and thus calling the program with the --noavx parameter would make it run fine in cpus without avx support.
But the resulting program will actualy have only one machine-code version of the function, with avx instructions, and will fail on older cpus.

Disabling all other optimizations doesn't solve this issue. Also tried No Enhanced Instructions - /arch:IA32 instead of Not Set as well.

As I'm just now beginning to understand templates and such, could someone point to me the exact details for this behavior and what I could actually do to achieve my goal?

My compiler is MSVC 2013.

Additional info: the .obj files for both fn_normal.cpp and fn_avx.cpp are almost the same size in bytes. I've looked into the generated assembly listings and they are almost the same, with the important difference that the avx-enabled source replaces default sse's movss/mulss with vmovss and vmulss, respectively. But stepping throught the code in Visual Studio's disassembly view (Ctrl+Alt+D), confirms that fnNormal() indeed makes use of the avx specialized instructions.

697

asked Mar 31 '15 23:03

Marc.2377

4 Answers

The compiler will generate two objects (fn_avx.obj and fn_normal.obj), which are compiled with different instruction sets. As you said, outputting the disassembly for both verifies that this is being done correctly:

objdump -d fn_normal.obj:

...
movss  -0x1f5c(%ebp,%eax,4),%xmm0
mulss  -0x1f5c(%ebp,%ecx,4),%xmm0
mov    -0x1f68(%ebp),%edx
mulss  -0x1f5c(%ebp,%edx,4),%xmm0
mov    -0x1f68(%ebp),%eax
movss  %xmm0,-0xfb4(%ebp,%eax,4)
...

objdump -d fn_avx.obj:

...
vmovss -0x1f5c(%ebp,%eax,4),%xmm0
vmulss -0x1f5c(%ebp,%ecx,4),%xmm0,%xmm0
mov    -0x1f68(%ebp),%edx
vmulss -0x1f5c(%ebp,%edx,4),%xmm0,%xmm0
mov    -0x1f68(%ebp),%eax
vmovss %xmm0,-0xfb4(%ebp,%eax,4)
...

The look strikingly similar, because by default MSVC 2013 will assume SSE2 availability. If you change the instruction set to IA32, you'll get something with non-vector instructions. So, this is not an issue with the compiler/compilation unit.

The issue here, is RtDouble is defined in a header file as a non-specialized template (perfectly legal). The compiler assumes its definition across multiple translation units will be the same, but, by compiling with different options, that assumption is being violated. It's essentially no different than introducing a divergence with the preprocessor:

double.h:

template<typename T>
int RtDouble(T number)
{
#ifdef SUPER_BAD
// Side effect: generates avx instructions
const int N = 1000;
float a[N], b[N];
for (int n = 0; n < N; ++n)
{
    a[n] = b[n] * b[n] * b[n];
}
return number * 2;
#else
return 0;
#endif
}

fn_avx.cpp:

#include "fn_avx.h"
#define SUPER_BAD
#include "double.h"

int FnAVX(int num)
{
    return RtDouble(num);
}

The FnNormal then will just return 0 (and you can verify this with the the disassembly of the new fn_normal.obj). The linker happily chooses one, and does not warn you about either situation. The question then comes down to: should it? That would be extremely helpful in situations like this. However, it would also slow down linking, as it would need to do a comparison of all of the functions that could exist in multiple compilation units (eg. inline functions as well).

When I have faced a similar issue in my code, I choose a different function naming scheme for the optimized version vs. the non-optimized version. Using a template parameter to distinguish them would also work just as well (as suggested in @celtschk's answer).

157

answered Oct 07 '22 06:10

MuertoExcobito

Basically the compiler needs to minimize the space not mentioning that having the same template instantiated 2x could cause problems if there would be static members. So from what I know the compiler is processing the template either for every source code and then chooses one of the implementations, or it postpones the actual code generation to the link time. Either way it is a problem for this AVX thingy. I ended up solving it the old fashioned way - with some global definitions not depending on any templates or anything. For too complex applications this could be a huge problem though. Intel Compiler has a recently added pragma (I don't recall the exact name), that makes the function implemented right after it use just AVX instructions, which would solve the problem. How reliable it is, that I don't know.

answered Oct 07 '22 06:10

mrzacek mrzacek

I've worked around this problem successfully by forcing any templated functions that will be used with different compiler options in different source files to be inline. Just using the inline keyword is usually not sufficient, since the compiler will sometimes ignore it for functions larger than some threshold, so you have to force the compiler to do it.

In MSVC++:

template<typename T>
__forceinline int RtDouble(T number) {...}

GCC:

template<typename T>
inline __attribute__((always_inline)) int RtDouble(T number) {...}

Keep in mind you may have to forceinline any other functions that RtDouble may call within the same module in order to keep the compiler flags consistent in those functions as well. Also keep in mind that MSVC++ simply ignores __forceinline when optimizations are disabled, such as in debug builds, and in those cases this trick won't work, so expect different behavior in non-optimized builds. It can make things problematic to debug in any case, but it does indeed work so long as the compiler allows inlining.

answered Oct 07 '22 07:10

Kumputer

I think the simplest solution is to let the compiler know that those functions are indeed intended to be different, by using a template parameter that does nothing but distinguish them:

File double.h:

template<bool avx, typename T>
int RtDouble(T number)
{
    // Side effect: generates avx instructions
    const int N = 1000;
    float a[N], b[N];
    for (int n = 0; n < N; ++n)
    {
        a[n] = b[n] * b[n] * b[n];
    }    
    return number * 2;
}

File fn_normal.cpp:

#include "fn_normal.h"
#include "double.h"

int FnNormal(int num)
{
    return RtDouble<false>(num);
}

File fn_avx.cpp:

#include "fn_avx.h"
#include "double.h"

int FnAVX(int num)
{
    return RtDouble<true>(num);
}

answered Oct 07 '22 07:10

celtschk

Related questions
                            
                                Performing set_difference on unordered sets
                            
                                passing allocated pointer before it allocated
                            
                                Visual Studio: How to use platform toolset as preprocessor directive?
                            
                                Performance: boost.compute v.s. opencl c++ wrapper
                            
                                sparse vector in C++? [closed]
                            
                                Eigen Matrix vs Numpy Array multiplication performance
                            
                                Understanding of MSVS C++ compiler optimizations
                            
                                Why the initializer of std::function has to be CopyConstructible?
                            
                                pthread_create() and memory leaks
                            
                                Reading first column data file as string
                            
                                Why std::vector::push_back needs the assignment operator
                            
                                Is it possible to initialise an array of non-POD with operator new and initialiser syntax?
                            
                                Why does std::min(std::initializer_list<T>) take arguments by value?
                            
                                One big OpenGL vertex buffer, or many small ones?
                            
                                Using `void_t` to check if a class has a method with a specific signature
                            
                                Use of typeid to handle different types
                            
                                QTextEdit delete whole line at given position
                            
                                Overloading >> ifstream_iterator for pairs [duplicate]
                            
                                Makefile wildcard dependencies
                            
                                Poco C++ building nested JSON objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With