I recently needed a lambda that captured multiple local variables by reference, so I made a test snippet to investigate its efficiency, and compiled it with -O3
using clang 3.6:
void do_something_with(void*);
void test()
{
int a = 0, b = 0, c = 0;
auto func = [&] () {
a++;
b++;
c++;
};
do_something_with((void*)&func);
}
movl $0x0,0x24(%rsp)
movl $0x0,0x20(%rsp)
movl $0x0,0x1c(%rsp)
lea 0x24(%rsp),%rax
mov %rax,(%rsp)
lea 0x20(%rsp),%rax
mov %rax,0x8(%rsp)
lea 0x1c(%rsp),%rax
mov %rax,0x10(%rsp)
lea (%rsp),%rdi
callq ...
Clearly the lambda only needs the address of one of the variables, from which all the others could be obtained by relative addressing.
Instead, the compiler created a struct on the stack containing pointers to each local variable, and then passed the address of the struct to the lambda. It's much in the same way as if I had written:
int a = 0, b = 0, c = 0;
struct X
{
int *pa, *pb, *pc;
};
X x = {&a, &b, &c};
auto func = [p = &x] () {
(*p->pa)++;
(*p->pb)++;
(*p->pc)++;
};
This is inefficient for various reasons, but most worryingly because it could lead to heap-allocation if too many variables are captured.
My questions:
The fact that both clang and gcc do this at -O3
makes me suspect that something in the standard actually forces closures to be implemented inefficiently. Is this the case?
If so, then for what reasoning? It cannot be for binary compatibility of lambdas between compilers, because any code that knows about the type of the lambda is guaranteed to lie in the same translation unit.
If not, then why is this optimisation missing from two major compilers?
EDIT:
Here is an example of the more efficient code that I would like to have seen from the compiler. This code uses less stack space, the lambda now only performs one pointer indirection instead of two, and the lambda's size does not grow in the number of captured variables:
struct X
{
int a = 0, b = 0, c = 0;
} x;
auto func = [&x] () {
x.a++;
x.b++;
x.c++;
};
movl $0x0,0x8(%rsp)
movl $0x0,0xc(%rsp)
movl $0x0,0x10(%rsp)
lea 0x8(%rsp),%rax
mov %rax,(%rsp)
lea (%rsp),%rdi
callq ...
In pass by reference, no new copy of the variable is made, so overhead of copying is saved. This makes programs efficient especially when passing objects of large structs or classes.
Time and space cost One advantage of call by reference over call by value-result is that it is often cheaper, especially if the data to be passed is large (time/space costs of copying). Call by reference may have a time cost because of indirect addressing, though that can sometimes be optimized out.
It looks like unspecified behavior. The following paragraph from the C++14 draft standard: N3936 section 5.1.2
Lambda Expressions [expr.prim.lambda] makes me think this:
An entity is captured by reference if it is implicitly or explicitly captured but not captured by copy. It is unspecified whether additional unnamed non-static data members are declared in the closure type for entities captured by reference. [...]
which different for entities captured by copy:
Every id-expression within the compound-statement of a lambda-expression that is an odr-use (3.2) of an entity captured by copy is transformed into an access to the corresponding unnamed data member of the closure type.
Thanks to dyp for pointing out some relevant documents which I somehow missed. It looks like defect report 750: Implementation constraints on reference-only closure objects provides the rationale for the current wording, and it says:
Consider an example like:
void f(vector<double> vec) { double x, y, z; fancy_algorithm(vec, [&]() { /* use x, y, and z in various ways */ }); }
5.1.2 [expr.prim.lambda] paragraph 8 requires that the closure class for this lambda will have three reference members, and paragraph 12 requires that it be derived from std::reference_closure, implying two additional pointer members. Although 8.3.2 [dcl.ref] paragraph 4 allows a reference to be implemented without allocation of storage, current ABIs require that references be implemented as pointers. The practical effect of these requirements is that the closure object for this lambda expression will contain five pointers. If not for these requirements, however, it would be possible to implement the closure object as a single pointer to the stack frame, generating data accesses in the function-call operator as offsets relative to the frame pointer. The current specification is too tightly constrained.
which echos your exact points about allowing potential optimizations and was implemented as part of N2927 which includes the following:
The new wording no longer specifies any rewrite or closure members for "by reference" capture. Uses of entities captured "by reference" affect the original entities, and the mechanism to achieve this is left entirely to the implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With