Sometimes, one wants to write a (small) CUDA device-side function which returns two values. In C, you would have that function take two out-parameters, e.g.:
__device__ void pair_maker(float x, float &out1, float& out2);
but in C++, the idiomatic way to write this is to return an std::pair
(well, maybe an std::tuple
, or a struct, but C++ tuples are clunky and a struct is not generic enough):
__device__ std::pair<float, float> pair_maker(float x);
My question: Can I trust NVCC (with --expt-relaxed-constexpr
) to optimize-away the construction of the pointer, and just assign directly to the variables which I later assign to from the .first
and .second
elements of the pair?
I don't have a complete answer, but from my limited experience - it seems that NVCC can optimize the std::pair
away. Illustration (also on GodBolt):
#include <utility>
__device__ std::pair<float, float> pair_maker(float x) {
float sin, cos;
__sincosf(x, &sin, &cos);
return {sin, cos};
}
__device__ float foo(float x) {
auto p = pair_maker(x);
auto sin = p.first;
auto cos = p.second;
return sin + cos;
}
__global__ void bar(float x, float *out) { *out = foo(x); }
__global__ void baz(float x, float *out) {
float sin, cos;
__sincosf(x, &sin, &cos);
*out = sin + cos;
}
The kernels bar()
and baz()
compile to the same PTX code:
ld.param.f32 %f1, [param_0];
ld.param.u64 %rd1, [param_1];
cvta.to.global.u64 %rd2, %rd1;
sin.approx.f32 %f2, %f1;
cos.approx.f32 %f3, %f1;
add.f32 %f4, %f2, %f3;
st.global.f32 [%rd2], %f4;
ret;
No extra copies or construction-related operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With