Here's one thing I haven't seen explicitly addressed in C++ expression template programming in order to avoid building unnecessary temporaries (through creating trees of "inlinable templated objects" that only get collapsed at the assignment operator). Suppose for the illustration we're modeling 1-D sequences of values, with elementwise application of arithmetic operators like +, *, etc. Call the basic class for fully-created sequences Seq (which holds a fixed-length list of doubles for the sake of concreteness) and consider the following illustrative pseudo-C++-code.
void f(Seq &a,Seq &b,Seq &c,Seq &d,Seq &e){
AType t=(a+2*b)/(a+b+c); // question is about what AType can be
Seq f=d*t;
Seq g=e*e*t;
//do something with f and g
}
where there are expression templated overloads for +, etc, elsewhere. For the line defining t:
I can implement this code if I make AType be Seq, but then I've created this full intermediate variable when I don't need it (except in how it enables computation of f and g). But at least it's only calculated once.
I can also implement this making AType be the appropriate templated expression type, so that a full Seq isn't created at the commented line, but consumed chunk-by-chunk in f and g. But then the same computation involved in creating every particular chunk will be repeated in both f and g. (I suppose in theory an incredibly smart compiler might realise the same computation is being done twice and CSE-it, but I don't think any do and I wouldn't want to rely on an optimiser always being able to spot the opportunities.)
My understanding is that there's no clever code rewriting and/or usage of templates that allow each chunk of t to be calculated only once and for t to be calculated chunkwise rather than all at once?
(I can vaguely imagine AType could be some kind of object that contains both an expression template type and a cached value that gets written after it's evaluated the first time, but that doesn't seem to help with the need to synchronise the two implicit loops in the assignments to f and g.)
In googling, I have come across one Masters thesis on another subject that mentions in passing that manual "common subexpression elimination" should be avoided with expression templates, but I'd like to find a more authoritative "it's not possible" or a "here's how to do it".
The closest stackoverflow question is Intermediate results using expression templates which seems to be about the type-naming issue rather than the efficiency issue in creating a full intermediate.
Since you obviously don't want to do the entire calculation twice, you have to cache it somehow. The easiest way to cache it seems to be for AType to be a Seq. You say This has the downside of a full intermediate variable,
but that's exactly what you want in this case. That full intermediate is your cache, and cannot be trivially avoided.
If you profile the code and this is a chokepoint, then the only faster way I can think of is to write a special function to calculate f and g in parallell, but that'd be super-confusing, and very much not recommended.
void g(Seq &d, Seq &e, Expr &t, Seq &f, Seq &g)
{
for(int i=0; i<d.size(); ++i) {
auto ti = t[i];
f[i] = d[i]*ti;
g[i] = e[i]*e[i]*ti;
}
}
void f(Seq &a,Seq &b,Seq &c,Seq &d,Seq &e)
{
Expr t = (a+2*b)/(a+b+c);
Seq f, g;
g(d, e, t, f, g);
//do something with f and g
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With