Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

optimizing of std::visit possible?

While using std::visit / std::variant I see in the profiler output that std::__detail::__variant::__gen_vtable_impl functions take the most time.

I did a test like this:

// 3 class families, all like this
class ElementDerivedN: public ElementBase
{
    ...
        std::variant<ElementDerived1*, ElementDerived2*,... > GetVariant() override { return this; }
}

std::vector<Element*> elements;
std::vector<Visitor*> visitors;
std::vector<Third*>   thirds;

// prepare a hack to get dynamic function object:
template<class... Ts> struct funcs : Ts... { using Ts::operator()...; };
template<class... Ts> funcs(Ts...) -> funcs<Ts...>;

// demo functions:
struct Actions { template < typename R, typename S, typename T> void operator()( R*, S*, T* ) {} };
struct SpecialActionForElement1{ template < typename S, typename T > void operator()( Element1*, S*, T* ) {} };


for ( auto el: elements )
{
    for ( auto vis: visitors )
    {
        for ( auto th: thirds )
        {
            std::visit( funcs{ Actions(), SpecialActionForElement1Derived1()}, el->GetVariant(), vis->GetVariant(), th->GetVariant() );
        }
    }
}

As said, std::__detail::__variant::__gen_vtable_impl<...> takes the most amount of time.

Q: As the generated n-dimensional function array generated at each visit call is from call to call the same, it would be nice to keep it between the calls of std::visit. Is that possible?

Maybe I am on the wrong path, if so, let me know!

EDIT: used compiler gcc7.3 from standard fedora installation. std-lib is used as standard in g++ ( what ever this is )

build options:

g++ --std=c++17 -fno-rtti main.cpp -O3 -g -o go
like image 621
Klaus Avatar asked Mar 07 '18 07:03

Klaus


1 Answers

I just had a look at a simpler example. The table is generated at compile time. The time is probably spent in lambdas generated in std::__detail::__variant::__gen_vtable_impl<...>. For some reason these lambdas which basically call the visitor do not omit the check for the actual type of the variant.

This function lets the compiler create code for four different versions of the visiting lambda inlined into lambdas created deep down in std::visit and stores the pointers to these lambdas in a static array:

double test(std::variant<int, double> v1, std::variant<int, double> v2) {
    return std::visit([](auto a, auto b) -> double {
        return a + b;
        }, v1, v2);
}

This is created in test:

  (...) ; load variant tags and check for bad variant
  lea rax, [rcx+rax*2] ; compute index in array
  mov rdx, rsi
  mov rsi, rdi
  lea rdi, [rsp+15]
  ; index into vtable with rax
  call [QWORD PTR std::__detail::__variant::(... bla lambda bla ...)::S_vtable[0+rax*8]]

This is generated for the <double, double> visitor:

std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<double (*)(test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, std::variant<int, double>&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&)>, std::tuple<test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&>, std::integer_sequence<unsigned long, 1ul, 1ul> >::__visit_invoke(test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&):
; whew, that is a long name :-)
; redundant checks are performed whether we are accessing variants of the correct type:
      cmp BYTE PTR [rdx+8], 1
      jne .L15
      cmp BYTE PTR [rsi+8], 1
      jne .L15
; the actual computation:
      movsd xmm0, QWORD PTR [rsi]
      addsd xmm0, QWORD PTR [rdx]
      ret

I would not be surprised if the profiler attributed both the time for these type checks and the time of your inlined visitors to std::__detail::__variant::__gen_vtable_impl<...>, rather than giving you the full 800-plus character name of the deeply nested lambda.

The only generic optimization potential I see here would be to omit the checks for bad variant in the lambdas. Since the lambdas are called through a function pointer only with matching variants, the compiler will have a very hard time statically discovering that the checks are redundant.

I had a look at the same example compiled with clang and libc++. In libc++ the redundant type checks are eliminated, so libstdc++ is not quite optimal yet.

decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul, 1ul>::__dispatch<std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&): # @"decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul, 1ul>::__dispatch<std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&)"
  ; no redundant check here
  movsd xmm0, qword ptr [rsi] # xmm0 = mem[0],zero
  addsd xmm0, qword ptr [rdx]
  ret

Maybe you can check what code is actually generated in your production software, just in case it is not similar to what I found with my example.

like image 170
PaulR Avatar answered Oct 07 '22 14:10

PaulR