Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is pmr::string so slow in these benchmarks?

Trying out the example in Section 5.9.2 Class monotonic_buffer_resource of the following article on Polymorphic Memory Resources by Pablo Halpern :

Doc No: N3816
Date: 2013-10-13
Author: Pablo Halpern
[email protected]
Polymorphic Memory Resources - r1
(Originally N3525 – Polymorphic Allocators)

The article claims that :

The monotonic_buffer_resource class is designed for very fast memory allocations in situations where memory is used to build up a few objects and then is released all at once when those objects go out of scope.

and that :

A particularly good use for a monotonic_buffer_resource is to provide memory for a local variable of container or string type. For example, the following code concatenates two strings, looks for the word “hello” in the concatenated string, and then discards the concatenated string after the word is found or not found. The concatenated string is expected to be no more than 80 bytes long, so the code is optimized for these short strings using a small monotonic_buffer_resource [...]

I've benchmarked the example using the google benchmark library and boost.container 1.69's polymorphic resources, compiled and linked to release binaries with g++-8 on an Ubuntu 18.04 LTS hyper-v virtual machine with the following code :

// overload using pmr::string
static bool find_hello(const boost::container::pmr::string& s1, const boost::container::pmr::string& s2)
{
    using namespace boost::container;

    char buffer[80];
    pmr::monotonic_buffer_resource m(buffer, 80);
    pmr::string s(&m);
    s.reserve(s1.length() + s2.length());
    s += s1;
    s += s2;
    return s.find("hello") != pmr::string::npos;
}

// overload using std::string
static bool find_hello(const std::string& s1, const std::string& s2)
{
    std::string s{};
    s.reserve(s1.length() + s2.length());
    s += s1;
    s += s2;
    return s.find("hello") != std::string::npos;
}

static void allocator_local_string(::benchmark::State& state)
{
    CLEAR_CACHE(2 << 12);

    using namespace boost::container;
    pmr::string s1(35, 'c'), s2(37, 'd');

    for (auto _ : state)
    {
        ::benchmark::DoNotOptimize(find_hello(s1, s2));
    }
}

// pmr::string with monotonic buffer resource benchmark registration
BENCHMARK(allocator_local_string)->Repetitions(5);

static void allocator_global_string(::benchmark::State& state)
{
    CLEAR_CACHE(2 << 12);

    std::string s1(35, 'c'), s2(37, 'd');

    for (auto _ : state) 
    {
        ::benchmark::DoNotOptimize(find_hello(s1, s2));
    }
}

// std::string using std::allocator and global allocator benchmark registration
BENCHMARK(allocator_global_string)->Repetitions(5);

Here are the results :
Benchmark Results

How is the pmr::string benchmark so slow compared to std::string?

I assume std::string's std::allocator should use "new" on the reserve call, and construct each character afterwards when calling :

s += s1; 
s += s2

Comparing that to a pmr::string using a polymorphic allocator that holds the monotonic_buffer_resource, reserving memory should boil down to simply pointer arithmetic, necessitating no "new" as the char buffer should be sufficient. Subsequently, it would construct each character as std::string does.

So, considering that the only differing operations between the pmr::string version of find_hello and the std::string version of find_hello is the call to reserve memory, with pmr::string using stack allocation and std::string using heap allocation :

  • Is my benchmark wrong?
  • Is my interpretation of how allocation should occur wrong?
  • Why is the pmr::string benchmark approximately 5 times slower than the std::string benchmark?
like image 433
Quoc-Minh Avatar asked Dec 18 '22 18:12

Quoc-Minh


1 Answers

There is a combination of things that makes boost pmr::basic_string slower:

  1. Construction of the pmr::monotonic_buffer_resource has some cost (17 nano-seconds here).
  2. pmr::basic_string::reserve reserves more than one requires. It reserves 96 bytes in this case, which is more than the 80 bytes you have.
  3. Reserving in pmr::basic_string is not free, even when the buffer is big enough (extra 8 nano-seconds here).
  4. The concatenation of strings is costly (extra 64 ns here).
  5. pmr::basic_string::find has a suboptimal implementation. This is the real cost for the poor speed. In GCC's std::basic_string::find uses __builtin_memchr to find the first character that might match, which boost is doing it all in one big loop. Apparently this is the main cost, and what makes boost run slower than std.

So, after increasing the buffer, and comparing boost::container::string with boost::container::pmr::string, the pmr version comes slightly slower (293 ns vs. 276 ns). This is because new and delete are actually quite fast for such micro-benchmarks, and are faster than the complicated machinery of the pmr (just 17 ns for construction). In fact, the default Linux/gcc new/delete reuse the same pointer again and again. This optimization has a very simple and fast implementation, that also works great with the CPU cache.

As a proof, try this out (without optimization):

for (int i=0 ; i < 10 ; ++i)
{
  char * ptr = new char[96];
  std::cout << (void*) ptr << '\n';
  delete[] ptr;
}

This prints the same pointer again and again.

The theory is that in a real program, where new/delete don't behave that nicely, and can't reuse the same block again and again, then new/delete slow down the execution much more and cache locality becomes quite poor. In such case the pmr+buffer are worth it.

Conclusion: the implementation of boost pmr string is slower than gcc's string. The pmr machinery is slightly more costly than the default and simple scenario of the new/delete.

like image 162
Michael Veksler Avatar answered Jan 03 '23 11:01

Michael Veksler