After some performance experiments, it seemed that using char16_t arrays may boost performance sometimes up to 40-50%, but it seems that using std::u16string without any copying and allocations should be as fast as C arrays. However, benchmarks are showing the opposite.
Here is the code I've written for benchmark (it uses Google Benchmark lib):
#include "benchmark/benchmark.h"
#include <string>
static std::u16string str;
static char16_t *str2;
static void BM_Strings(benchmark::State &state) {
while (state.KeepRunning()) {
for (size_t i = 0; i < str.size(); i++){
benchmark::DoNotOptimize(str[i]);
}
}
}
static void BM_CharArray(benchmark::State &state) {
while (state.KeepRunning()) {
for (size_t i = 0; i < str.size(); i++){
benchmark::DoNotOptimize(str2[i]);
}
}
}
BENCHMARK(BM_Strings);
BENCHMARK(BM_CharArray);
static void init(){
str = u"Various applications of randomness have led to the development of several different methods ";
str2 = (char16_t *) str.c_str();
}
int main(int argc, char** argv) {
init();
::benchmark::Initialize(&argc, argv);
::benchmark::RunSpecifiedBenchmarks();
}
It shows the following result:
Run on (8 X 2200 MHz CPU s)
2017-07-11 23:05:57
Benchmark Time CPU Iterations
---------------------------------------------------
BM_Strings 1832 ns 1830 ns 365938
BM_CharArray 928 ns 926 ns 712577
I'm using clang (Apple LLVM version 8.1.0 (clang-802.0.42)) on mac. With optimizations turned on the gap is smaller but still noticeable:
Benchmark Time CPU Iterations
---------------------------------------------------
BM_Strings 242 ns 241 ns 2906615
BM_CharArray 161 ns 161 ns 4552165
Can someone explain what's going on here and why there is a difference?
Updated (mixing the order and added few warm-up steps):
Benchmark Time CPU Iterations
---------------------------------------------------
BM_CharArray 670 ns 665 ns 903168
BM_Strings 856 ns 854 ns 817776
BM_CharArray 166 ns 166 ns 4369997
BM_Strings 225 ns 225 ns 3149521
Also including compile flags I'm using:
/usr/bin/clang++ -I{some includes here} -O3 -std=c++14 -stdlib=libc++ -Wall -Wextra -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk -O3 -fsanitize=address -Werror -o CMakeFiles/BenchmarkString.dir/BenchmarkString.cpp.o -c test/benchmarks/BenchmarkString.cpp
Interestingly I am unable to reproduce your results. I can barely detect a difference between the two.
The (incomplete) code I used is shown here:
hol::StdTimer timer;
using index_type = std::size_t;
index_type const N = 100'000'000;
index_type const SIZE = 1024;
static std::u16string s16;
static char16_t const* p16;
int main(int, char** argv)
{
std::generate_n(std::back_inserter(s16), SIZE,
[]{ return (char)hol::random_number((int)'A', (int)'Z'); });
p16 = s16.c_str();
unsigned sum;
{
sum = 0;
timer.start();
for(index_type n = 0; n < N; ++n)
for(index_type i = 0; i < SIZE; ++i)
sum += s16[i];
timer.stop();
RESULT("string", sum, timer);
}
{
sum = 0;
timer.start();
for(std::size_t n = 0; n < N; ++n)
for(std::size_t i = 0; i < SIZE; ++i)
sum += p16[i];
timer.stop();
RESULT("array ", sum, timer);
}
}
Output:
string: (670240768) 17.575232 secs
array : (670240768) 17.546145 secs
Compiler:
GCC 7.1
g++ -std=c++14 -march=native -O3 -D NDEBUG
Because of the way libc++ implements the small string optimization, on every dereference it needs to check whether the string contents are stored in the string object itself or on the heap. Because the indexing is wrapped in benchmark::DoNotOptimize
, it needs to perform this check every time a character is accessed. When accessing the string data via a pointer the data is always external, and so requires no check.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With