Why std::u16string is slower than array of char16_t?

Question

After some performance experiments, it seemed that using char16_t arrays may boost performance sometimes up to 40-50%, but it seems that using std::u16string without any copying and allocations should be as fast as C arrays. However, benchmarks are showing the opposite.

Here is the code I've written for benchmark (it uses Google Benchmark lib):

#include "benchmark/benchmark.h"
#include <string>

static std::u16string str;
static char16_t *str2;

static void BM_Strings(benchmark::State &state) {
    while (state.KeepRunning()) {
        for (size_t i = 0; i < str.size(); i++){
            benchmark::DoNotOptimize(str[i]);
        }
    }
}

static void BM_CharArray(benchmark::State &state) {
    while (state.KeepRunning()) {
        for (size_t  i = 0; i < str.size(); i++){
            benchmark::DoNotOptimize(str2[i]);
        }
    }
}

BENCHMARK(BM_Strings);
BENCHMARK(BM_CharArray);

static void init(){
    str = u"Various applications of randomness have led to the development of several different methods ";
    str2 = (char16_t *) str.c_str();
}

int main(int argc, char** argv) {
    init();
    ::benchmark::Initialize(&argc, argv);
    ::benchmark::RunSpecifiedBenchmarks();
}

It shows the following result:

Run on (8 X 2200 MHz CPU s)
2017-07-11 23:05:57
Benchmark             Time           CPU Iterations
---------------------------------------------------
BM_Strings         1832 ns       1830 ns     365938
BM_CharArray        928 ns        926 ns     712577

I'm using clang (Apple LLVM version 8.1.0 (clang-802.0.42)) on mac. With optimizations turned on the gap is smaller but still noticeable:

 Benchmark             Time           CPU Iterations
---------------------------------------------------
BM_Strings          242 ns        241 ns    2906615
BM_CharArray        161 ns        161 ns    4552165

Can someone explain what's going on here and why there is a difference?

Updated (mixing the order and added few warm-up steps):

Benchmark             Time           CPU Iterations
---------------------------------------------------
BM_CharArray        670 ns        665 ns     903168
BM_Strings          856 ns        854 ns     817776
BM_CharArray        166 ns        166 ns    4369997
BM_Strings          225 ns        225 ns    3149521

Also including compile flags I'm using:

/usr/bin/clang++ -I{some includes here} -O3 -std=c++14 -stdlib=libc++ -Wall -Wextra -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk -O3 -fsanitize=address -Werror -o CMakeFiles/BenchmarkString.dir/BenchmarkString.cpp.o -c test/benchmarks/BenchmarkString.cpp

Galik · Accepted Answer

Interestingly I am unable to reproduce your results. I can barely detect a difference between the two.

The (incomplete) code I used is shown here:

hol::StdTimer timer;

using index_type = std::size_t;

index_type const N = 100'000'000;
index_type const SIZE = 1024;

static std::u16string s16;
static char16_t const* p16;

int main(int, char** argv)
{
    std::generate_n(std::back_inserter(s16), SIZE,
        []{ return (char)hol::random_number((int)'A', (int)'Z'); });

    p16 = s16.c_str();
    unsigned sum;

    {
        sum = 0;

        timer.start();
        for(index_type n = 0; n < N; ++n)
            for(index_type i = 0; i < SIZE; ++i)
                sum += s16[i];
        timer.stop();

        RESULT("string", sum, timer);
    }

    {
        sum = 0;

        timer.start();
        for(std::size_t n = 0; n < N; ++n)
            for(std::size_t i = 0; i < SIZE; ++i)
                sum += p16[i];
        timer.stop();

        RESULT("array ", sum, timer);
    }
}

Output:

string: (670240768) 17.575232 secs
array : (670240768) 17.546145 secs

Compiler:

GCC 7.1 
g++ -std=c++14 -march=native -O3 -D NDEBUG

Marc Aldorasi · Answer

Because of the way libc++ implements the small string optimization, on every dereference it needs to check whether the string contents are stored in the string object itself or on the heap. Because the indexing is wrapped in benchmark::DoNotOptimize, it needs to perform this check every time a character is accessed. When accessing the string data via a pointer the data is always external, and so requires no check.

Why std::u16string is slower than array of char16_t?

Tags:

c++

performance

arrays

c++11

stl

Stanislav Levental

2 Answers

Galik

Marc Aldorasi

Recent Activity

Donate For Us

Why std::u16string is slower than array of char16_t?

Tags:

c++

performance

arrays

c++11

stl

Stanislav Levental

2 Answers

Galik

Marc Aldorasi

Related questions

Recent Activity

Donate For Us