Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are templates + functors/lambdas suboptimal in terms of memory usage?

For illustrative purposes, let's say I want to implement a generic integer comparing function. I can think of a few approaches for the definition/invocation of the function.

(A) Function Template + functors

template <class Compare> void compare_int (int a, int b, const std::string& msg, Compare cmp_func) 
{
    if (cmp_func(a, b)) std::cout << "a is " << msg << " b" << std::endl;
    else std::cout << "a is not " << msg << " b" << std::endl;
}

struct MyFunctor_LT {
    bool operator() (int a, int b) {
        return a<b;
    }
};

And this would be a couple of calls to this function:

MyFunctor_LT mflt;
MyFunctor_GT mfgt; //not necessary to show the implementation
compare_int (3, 5, "less than", mflt);
compare_int (3, 5, "greater than", mflt);

(B) Function template + lambdas

We would call compare_int like this:

compare_int (3, 5, "less than", [](int a, int b) {return a<b;});
compare_int (3, 5, "greater than", [](int a, int b) {return a>b;});

(C) Function template + std::function

Same template implementation, invocation:

std::function<bool(int,int)> func_lt = [](int a, int b) {return a<b;}; //or a functor/function
std::function<bool(int,int)> func_gt = [](int a, int b) {return a>b;}; 

compare_int (3, 5, "less than", func_lt);
compare_int (3, 5, "greater than", func_gt);

(D) Raw "C-style" pointers

Implementation:

void compare_int (int a, int b, const std::string& msg, bool (*cmp_func) (int a, int b)) 
{
 ...
}

bool lt_func (int a, int b) 
{
    return a<b;
}

Invocation:

compare_int (10, 5, "less than", lt_func); 
compare_int (10, 5, "greater than", gt_func);

With those scenarios laid out, we have in each case:

(A) Two template instances (two different parameters) will be compiled and allocated in memory.

(B) I would say also two template instances would be compiled. Each lambda is a different class. Correct me if I'm wrong, please.

(C) Only one template instance would be compiled, since template parameter is always te same: std::function<bool(int,int)>.

(D) Obviously we only have one instance.

Needless to day, it doesn't make a difference for such a naive example. But when working with dozens (or hundreds) of templates, and numerous functors, the compilation times and memory usage difference can be substantial.

Can we say that in many circumstances (i.e., when using too many functors with the same signature) std::function (or even function pointers) must be preferred over templates+raw functors/lambdas? Wrapping your functor or lambda with std::function may be very convenient.

I am aware that std::function (function pointer too) introduces an overhead. Is it worth it?

EDIT. I did a very simple benchmark using the following macros and a very common standard library function template (std::sort):

#define TEST(X) std::function<bool(int,int)>  f##X = [] (int a, int b) {return (a^X)<(b+X);}; \
std::sort (v.begin(), v.end(), f##X);

#define TEST2(X) auto f##X = [] (int a, int b) {return (a^X)<(b^X);}; \
std::sort (v.begin(), v.end(), f##X);

#define TEST3(X) bool(*f##X)(int, int) = [] (int a, int b) {return (a^X)<(b^X);}; \ 
std::sort (v.begin(), v.end(), f##X);

Results are the following regarding size of the generated binaries (GCC at -O3):

  • Binary with 1 TEST macro instance: 17009
  • 1 TEST2 macro instance: 9932
  • 1 TEST3 macro instance: 9820
  • 50 TEST macro instances: 59918
  • 50 TEST2 macro instances: 94682
  • 50 TEST3 macro instances: 16857

Even if I showed the numbers, it is a more qualitative than quantitative benchmark. As we were expecting, function templates based on the std::function parameter or function pointer scale better (in terms of size), as not so many instances are created. I did not measure runtime memory usage though.

As for the performance results (vector size is 1000000 of elements):

  • 50 TEST macro instances: 5.75s
  • 50 TEST2 macro instances: 1.54s
  • 50 TEST3 macro instances: 3.20s

It is a notable difference, and we must not neglect the overhead introduced by std::function (at least if our algorithms consist of millions of iterations).

like image 977
jbgs Avatar asked Jan 14 '14 22:01

jbgs


4 Answers

As others have already pointed out, lambdas and function objects are likely to be inlined, especially if the body of the function is not too long. As a consequence, they are likely to be better in terms of speed and memory usage than the std::function approach. The compiler can optimize your code more aggressively if the function can be inlined. Shockingly better. std::function would be my last resort for this reason, among other things.

But when working with dozens (or hundreds) of templates, and numerous functors, the compilation times and memory usage difference can be substantial.

As for the compilation times, I wouldn't worry too much about it as long as you are using simple templates like the one shown. (If you are doing template metaprogramming, yeah, then you can start worrying.)

Now, the memory usage: By the compiler during compilation or by the generated executable at run time? For the former, same holds as for the compilation time. For the latter: Inlined lamdas and function objects are the winners.

Can we say that in many circumstances std::function (or even function pointers) must be preferred over templates+raw functors/lambdas? I.e. wrapping your functor or lambda with std::function may be very convenient.

I am not quite sure how to answer to that one. I cannot define "many circumstances".

However, one thing I can say for sure is that type erasure is a way to avoid / reduce code bloat due to templates, see Item 44: Factor parameter-independent code out of templates in Effective C++. By the way, std::function uses type erasure internally. So yes, code bloat is an issue.

I am aware that std::function (function pointer too) introduces an overhead. Is it worth it?

"Want speed? Measure." (Howard Hinnant)

One more thing: function calls through function pointers can be inlined (even across compilation units!). Here is a proof:

#include <cstdio>

bool lt_func(int a, int b) 
{
    return a<b;
}

void compare_int(int a, int b, const char* msg, bool (*cmp_func) (int a, int b)) {
    if (cmp_func(a, b)) printf("a is %s b\n", msg);
    else printf("a is not %s b\n", msg);
}

void f() {
  compare_int (10, 5, "less than", lt_func); 
}

This is a slightly modified version of your code. I removed all the iostream stuff because it makes the generated assembly cluttered. Here is the assembly of f():

.LC1:
    .string "a is not %s b\n"
[...]
.LC2:
    .string "less than"
[...]
f():
.LFB33:
    .cfi_startproc
    movl    $.LC2, %edx
    movl    $.LC1, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    jmp __printf_chk
    .cfi_endproc

Which means, that gcc 4.7.2 inlined lt_func at -O3. In fact, the generated assembly code is optimal.

I have also checked: I moved the implementation of lt_func into a separate source file and enabled link time optimization (-flto). GCC still happily inlined the call through the function pointer! It is nontrivial and you need a quality compiler to do that.


Just for the record, and that you can actually feel the overhead of the std::function approach:

This code:

#include <cstdio>
#include <functional>

template <class Compare> void compare_int(int a, int b, const char* msg, Compare cmp_func) 
{
    if (cmp_func(a, b)) printf("a is %s b\n", msg);
    else printf("a is not %s b\n", msg);
}

void f() {
  std::function<bool(int,int)> func_lt = [](int a, int b) {return a<b;};
  compare_int (10, 5, "less than", func_lt); 
}

yields this assembly at -O3 (approx. 140 lines):

f():
.LFB498:
    .cfi_startproc
    .cfi_personality 0x3,__gxx_personality_v0
    .cfi_lsda 0x3,.LLSDA498
    pushq   %rbx
    .cfi_def_cfa_offset 16
    .cfi_offset 3, -16
    movl    $1, %edi
    subq    $80, %rsp
    .cfi_def_cfa_offset 96
    movq    %fs:40, %rax
    movq    %rax, 72(%rsp)
    xorl    %eax, %eax
    movq    std::_Function_handler<bool (int, int), f()::{lambda(int, int)#1}>::_M_invoke(std::_Any_data const&, int, int), 24(%rsp)
    movq    std::_Function_base::_Base_manager<f()::{lambda(int, int)#1}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<f()::{lambda(int, int)#1}> const&, std::_Manager_operation), 16(%rsp)
.LEHB0:
    call    operator new(unsigned long)
.LEHE0:
    movq    %rax, (%rsp)
    movq    16(%rsp), %rax
    movq    $0, 48(%rsp)
    testq   %rax, %rax
    je  .L14
    movq    24(%rsp), %rdx
    movq    %rax, 48(%rsp)
    movq    %rsp, %rsi
    leaq    32(%rsp), %rdi
    movq    %rdx, 56(%rsp)
    movl    $2, %edx
.LEHB1:
    call    *%rax
.LEHE1:
    cmpq    $0, 48(%rsp)
    je  .L14
    movl    $5, %edx
    movl    $10, %esi
    leaq    32(%rsp), %rdi
.LEHB2:
    call    *56(%rsp)
    testb   %al, %al
    movl    $.LC0, %edx
    jne .L49
    movl    $.LC2, %esi
    movl    $1, %edi
    xorl    %eax, %eax
    call    __printf_chk
.LEHE2:
.L24:
    movq    48(%rsp), %rax
    testq   %rax, %rax
    je  .L23
    leaq    32(%rsp), %rsi
    movl    $3, %edx
    movq    %rsi, %rdi
.LEHB3:
    call    *%rax
.LEHE3:
.L23:
    movq    16(%rsp), %rax
    testq   %rax, %rax
    je  .L12
    movl    $3, %edx
    movq    %rsp, %rsi
    movq    %rsp, %rdi
.LEHB4:
    call    *%rax
.LEHE4:
.L12:
    movq    72(%rsp), %rax
    xorq    %fs:40, %rax
    jne .L50
    addq    $80, %rsp
    .cfi_remember_state
    .cfi_def_cfa_offset 16
    popq    %rbx
    .cfi_def_cfa_offset 8
    ret
    .p2align 4,,10
    .p2align 3
.L49:
    .cfi_restore_state
    movl    $.LC1, %esi
    movl    $1, %edi
    xorl    %eax, %eax
.LEHB5:
    call    __printf_chk
    jmp .L24
.L14:
    call    std::__throw_bad_function_call()
.LEHE5:
.L32:
    movq    48(%rsp), %rcx
    movq    %rax, %rbx
    testq   %rcx, %rcx
    je  .L20
    leaq    32(%rsp), %rsi
    movl    $3, %edx
    movq    %rsi, %rdi
    call    *%rcx
.L20:
    movq    16(%rsp), %rax
    testq   %rax, %rax
    je  .L29
    movl    $3, %edx
    movq    %rsp, %rsi
    movq    %rsp, %rdi
    call    *%rax
.L29:
    movq    %rbx, %rdi
.LEHB6:
    call    _Unwind_Resume
.LEHE6:
.L50:
    call    __stack_chk_fail
.L34:
    movq    48(%rsp), %rcx
    movq    %rax, %rbx
    testq   %rcx, %rcx
    je  .L20
    leaq    32(%rsp), %rsi
    movl    $3, %edx
    movq    %rsi, %rdi
    call    *%rcx
    jmp .L20
.L31:
    movq    %rax, %rbx
    jmp .L20
.L33:
    movq    16(%rsp), %rcx
    movq    %rax, %rbx
    testq   %rcx, %rcx
    je  .L29
    movl    $3, %edx
    movq    %rsp, %rsi
    movq    %rsp, %rdi
    call    *%rcx
    jmp .L29
    .cfi_endproc

Which approach would you like to choose when it comes to performance?

like image 77
Ali Avatar answered Nov 15 '22 20:11

Ali


If you bind your lambda to a std::function then your code will run slower because it will no longer be inlineable, invoking goes through function pointer and creation of the function object possibly requires a heap allocation if the size of the lambda (= size of captured state) exceeds the small buffer limit (which is equal to the size of one or two pointers on GCC IIRC).

If you keep your lambda, e.g. auto a = []{};, then it will be as fast as an inlineable function (perhaps even faster because there's no conversion to function pointer when passed as an argument to a function.)

The object code generated by lambda and inlineable function objects will be zero when compiling with optimizations enabled (-O1 or higher in my tests). Occasionally the compiler may refuse to inline but that normally only happens when trying to inline big function bodies.

You can always look at the generated assembly if you want to be certain.

like image 44
StackedCrooked Avatar answered Nov 15 '22 18:11

StackedCrooked


I will discuss what happens naively, and common optimizations.

(A) Function Template + functors

In this case, there will be one functor whose type and arguments fully describes what happens when you invoke (). There will be one instance of the template function per functor passed to it.

While the functor has a minimium size of 1 byte, and must technically be copied, the copy is a no-op (not even the 1 byte of space need be copied: a dumb compiler/low optimization settings may result in it being copied anyhow).

Optimizing the existance of that functor object away is an easy thing for compilers to do: the method is inline and has the same symbol name, so this will even happen over multiple compilation units. Inlining the call is also very easy.

If you had multiple function objects with the same implementation, havine one instance of the functor despite this takes a bit of effort, but some compilers can do it. Inlining the template functions may also be easy. In your toy example, as the inputs are known at compile time, branches can be evaluated at compile time, dead code eliminated, and everything reduced to a single std::cout call.

(B) Function template + lambdas

In this case, each lambda is its own type, and each closure instance is an undefined size instance of that lambda (usually 1 byte as it captures nothing). If an identical lambda is defined and used at different spots, these are different types. Each call location for the function object is a distinct instantiation of the template function.

Removing the existance of the 1 byte closures, assuming they are stateless, is easy. Inlining them is also easy. Removing duplicate instances of the template function with the same implementation but different signature is harder, but some compilers will do it. Inlining said functions is no harder than above. In your toy example, as the inputs are known at compile time, branches can be evaluated at compile time, dead code eliminated, and everything reduced to a single std::cout call.

(C) Function template + std::function

std::function is a type erasure object. An instance of std::function with a given signature has the same type as another. However, the constructor for std::function is templated on the type passed in. In your case, you are passing in a lambda -- so each location where you initialize a std::function with a lambda generates a distinct std::function constructor, which does unknown code.

A typical implementation of std::function will use the pImpl pattern to store a pointer to the abstract interface to a helper object that wraps your callable, and knows how to copy/move/call it with the std::function signature. One such callable type is created per type std::function is constructed from per std::function signature it is constructed to.

One instance of the function will be created, taking a std::function.

Some compilers can notice duplicate methods and use the same implementation for both, and maybe pull of a similar trick for (much) of their virtual function table (but not all, as dynamic casting requires they be different). This is less likely to happen than the earlier duplicate function elimination. The code in the duplicated methods used by the std::function helper is probably simpler than other duplicated functions, so that could be cheaper.

While the template function can be inlined, I am not aware of a C++ compiler that can optimize the existance of a std::function away, as they are usually implemented as library solutions consisting of relatively complex and opaque code to the compiler. So while theoretically it could be evaluated as all information is there, in practice the std::function will not be inlined into the template function, and no dead code elimination will occur. Both branches will make it into the resulting binary, together with a pile of std::function boilerplate for itself and its helper.

Calling a std::function is roughly as expensive as making a virtual method call -- or, as a rule of thumb, very roughly as expensive as two function pointer calls.

(D) Raw "C-style" pointers

A function is created, its address taken, this address is passed to compare_int. It then dereferences that pointer to find the actual function, and calls it.

Some compilers are good at noticing that the function pointer is created from a literal, then inlining the call here: not all compilers can do this, and no or few compilers can do this in the general case. If they cannot (because the initialization isn't from a literal, because the interface is in one compilation unit while the implementation is in another) there is a significant cost to following a function pointer to data -- the computer tends to be unable to cache the location it is going to, so there is a pipeline stall.

Note that you can call the raw "C-style" pointers with stateless lambdas, as stateless lambdas implicitly convert to function pointers. Also note that this example is strictly weaker than the other ones: it does not accept stateful functions. The version with the same power would be a C-style function that takes both the pair of ints and a void* state.

like image 36
Yakk - Adam Nevraumont Avatar answered Nov 15 '22 19:11

Yakk - Adam Nevraumont


You should use auto keyword with lambda functions, not std::function. That way you get unique type and no runtime overhead of std::function.

Also, as dyp suggests, stateless (that is, no captures) lamba functions can be converted to function pointer.

like image 27
dsi Avatar answered Nov 15 '22 19:11

dsi