I reached a bottleneck in my code, so the main issue of this question is performance. I have a hexadecimal checksum and I want to check the leading zeros of an array of chars. This is what I am doing: <pre class="prettyprint"><code>bool starts_with (char* cksum_hex, int n_zero) { bool flag {true}; for (int i=0; i<n_zero; ++i) flag &= (cksum_hex[i]=='0'); return flag; } </code></pre> The above function returns true if the <code>cksum_hex</code> has <code>n_zero</code> leading zeros. However, for my application, this function is very expensive (60% of total time). In other words, it is the bottleneck of my code. So I need to improve it. I also checked <code>std::string::starts_with</code> which is available in C++20 and I observed no difference in performance: <pre class="prettyprint"><code>// I have to convert cksum to string std::string cksum_hex_s (cksum_hex); cksum_hex_s.starts_with("000"); // checking for 3 leading zeros </code></pre> For more information I am using <code>g++ -O3 -std=c++2a</code> and my gcc version is 9.3.1. <h3>Questions</h3> <ul> <li>What is the faster way of checking the leading characters in a char array?</li> <li>Is there a more efficient way of doing it with <code>std::string::starts_with</code>?</li> <li>Does the bitwise operations help here?</li> </ul>

If you modify your function to return early <pre class="prettyprint lang-cpp prettyprint-override"><code>bool starts_with (char* cksum_hex, int n_zero) { for (int i=0; i<n_zero; ++i) { if (cksum_hex[i] != '0') return false; } return true; } </code></pre> It will be faster in case of big <code>n_zero</code> and <code>false</code> result. Otherwise, maybe you can try to allocate a global array of characters <code>'0'</code> and use <code>std::memcmp</code>: <pre class="prettyprint lang-cpp prettyprint-override"><code>// make it as big as you need constexpr char cmp_array[4] = {'0', '0', '0', '0'}; bool starts_with (char* cksum_hex, int n_zero) { return std::memcmp(cksum_hex, cmp_array, n_zero) == 0; } </code></pre> The problem here is that you need to assume some max possible value of <code>n_zero</code>. Live example === EDIT === Considering the complains about no profiling data to justify the suggested approaches, here you go: <ul> <li> Benchmark results comparing early return implementation with <code>memcmp</code> implementation</li> <li> Benchmark results comparing <code>memcmp</code> implementation with OP original implementation</li> </ul> Data used: <pre class="prettyprint"><code>const char* cs1 = "00000hsfhjshjshgj"; const char* cs2 = "20000hsfhjshjshgj"; const char* cs3 = "0000000000hsfhjshjshgj"; const char* cs4 = "0000100000hsfhjshjshgj"; </code></pre> <code>memcmp</code> is fastest in all cases but <code>cs2</code> with early return impl. <img src="https://i.stack.imgur.com/bKlTf.png" alt="Early return vs memcmp"> <img src="https://i.stack.imgur.com/W6cZT.png" alt="memcmp vs orig">

What is the fastest way to check the leading characters in a char array?

Tags:

c++

performance

optimization

c++20

I reached a bottleneck in my code, so the main issue of this question is performance.

I have a hexadecimal checksum and I want to check the leading zeros of an array of chars. This is what I am doing:

bool starts_with (char* cksum_hex, int n_zero) {   bool flag {true};   for (int i=0; i<n_zero; ++i)     flag &= (cksum_hex[i]=='0');   return flag; }

The above function returns true if the cksum_hex has n_zero leading zeros. However, for my application, this function is very expensive (60% of total time). In other words, it is the bottleneck of my code. So I need to improve it.

I also checked std::string::starts_with which is available in C++20 and I observed no difference in performance:

// I have to convert cksum to string std::string cksum_hex_s (cksum_hex); cksum_hex_s.starts_with("000");     // checking for 3 leading zeros

For more information I am using g++ -O3 -std=c++2a and my gcc version is 9.3.1.

Questions

What is the faster way of checking the leading characters in a char array?
Is there a more efficient way of doing it with std::string::starts_with?
Does the bitwise operations help here?

634

asked Aug 13 '20 08:08

Ali

2 Answers

If you modify your function to return early

bool starts_with (char* cksum_hex, int n_zero) {   for (int i=0; i<n_zero; ++i)   {     if (cksum_hex[i] != '0') return false;   }   return true; }

It will be faster in case of big n_zero and false result. Otherwise, maybe you can try to allocate a global array of characters '0' and use std::memcmp:

// make it as big as you need constexpr char cmp_array[4] = {'0', '0', '0', '0'}; bool starts_with (char* cksum_hex, int n_zero) {     return std::memcmp(cksum_hex, cmp_array, n_zero) == 0; }

The problem here is that you need to assume some max possible value of n_zero.

Live example

=== EDIT ===

Considering the complains about no profiling data to justify the suggested approaches, here you go:

Benchmark results comparing early return implementation with memcmp implementation
Benchmark results comparing memcmp implementation with OP original implementation

Data used:

const char* cs1 = "00000hsfhjshjshgj"; const char* cs2 = "20000hsfhjshjshgj"; const char* cs3 = "0000000000hsfhjshjshgj"; const char* cs4 = "0000100000hsfhjshjshgj";

memcmp is fastest in all cases but cs2 with early return impl.

Early return vs memcmp memcmp vs orig

answered Oct 11 '22 21:10

pptaszni

Presumably you also have the binary checksum? Instead of converting it to ASCII text first, look at the 4*n high bits to check n nibbles directly for 0 rather than checking n bytes for equality to '0'.

e.g. if you have the hash (or the high 8 bytes of it) as a uint64_t or unsigned __int128, right-shift it to keep only the high n nibbles.

I showed some examples of how they compile for x86-64 when both inputs are runtime variables, but these also compile nicely to other ISAs like AArch64. This code is all portable ISO C++.

bool starts_with (uint64_t cksum_high8, int n_zero) {     int shift = 64 - n_zero * 4;       // A hex digit represents a 4-bit nibble     return (cksum_high8 >> shift) == 0; }

clang does a nice job for x86-64 with -O3 -march=haswell to enable BMI1/BMI2

high_zero_nibbles(unsigned long, int):         shl     esi, 2         neg     sil                  # x86 shifts wrap the count so 64 - c is the same as -c         shrx    rax, rdi, rsi        # BMI2 variable-count shifts save some uops.         test    rax, rax         sete    al         ret

This even works for n=16 (shift=0) to test all 64 bits. It fails for n_zero = 0 to test none of the bits; it would encounter UB by shifting a uint64_t by a shift count >= its width. (On ISAs like x86 that wrap out-of-bounds shift counts, code-gen that worked for other shift counts would result in checking all 16 bits. As long as the UB wasn't visible at compile time...) Hopefully you're not planning to call this with n_zero=0 anyway.

Other options: create a mask that keeps only the high n*4 bits, perhaps shortening the critical path through cksum_high8 if that's ready later than n_zero. Especially if n_zero is a compile-time constant after inlining, this can be as fast as checking cksum_high8 == 0. (e.g. x86-64 test reg, immediate.)

bool high_zero_nibbles_v2 (uint64_t cksum_high8, int n_zero) {     int shift = 64 - n_zero * 4;         // A hex digit represents a 4-bit nibble     uint64_t low4n_mask = (1ULL << shift) - 1;     return cksum_high8 & ~low4n_mask; }

Or use a bit-scan function to count leading zero bits and compare for >= 4*n. Unfortunately it took ISO C++ until C++20 <bit>'s countl_zero to finally portably expose this common CPU feature that's been around for decades (e.g. 386 bsf / bsr); before that only as compiler extensions like GNU C __builtin_clz.

This is great if you want to know how many and don't have one specific cutoff threshold.

bool high_zero_nibbles_lzcnt (uint64_t cksum_high8, int n_zero) {     // UB on cksum_high8 == 0.  Use x86-64 BMI1 _lzcnt_u64 to avoid that, guaranteeing 64 on input=0     return __builtin_clzll(cksum_high8) > 4*n_zero; }  #include <bit> bool high_zero_nibbles_stdlzcnt (uint64_t cksum_high8, int n_zero) {     return std::countl_zero(cksum_high8) > 4*n_zero; }

compile to (clang for Haswell):

high_zero_nibbles_lzcnt(unsigned long, int):         lzcnt   rax, rdi         shl     esi, 2         cmp     esi, eax         setl    al                    # FLAGS -> boolean integer return value         ret

All these instructions are cheap on Intel and AMD, and there's even some instruction-level parallelism between lzcnt and shl.

See asm output for all 4 of these on the Godbolt compiler explorer. Clang compiles 1 and 2 to identical asm. Same for both lzcnt ways with -march=haswell. Otherwise it needs to go out of its way to handle the bsr corner case for input=0, for the C++20 version where that's not UB.

To extend these to wider hashes, you can check the high uint64_t for being all-zero, then proceed to the next uint64_t chunk.

Using an SSE2 compare with pcmpeqb on the string, pmovmskb -> bsf could find the position of the first 1 bit, thus how many leading-'0' characters there were in the string representation, if you have that to start with. So x86 SIMD can do this very efficiently, and you can use that from C++ via intrinsics.

answered Oct 11 '22 23:10

Peter Cordes

Related questions
                            
                                GCC error: explicit specialization in non-namespace scope
                            
                                What can a 'const' method change?
                            
                                What is the point of an SDL2 Texture?
                            
                                Why is dereferencing a pointer called dereferencing?
                            
                                Why doesn't PRIu64 work in this code?
                            
                                Why is std::rotate so fast?
                            
                                Can't find Windows Forms Application for C++
                            
                                Is void *function() a pointer to function or a function returning a void*?
                            
                                C++ Thread, shared data
                            
                                Multithreaded paranoia
                            
                                Examples of practical usage of Boost::MPL?
                            
                                how to correctly write vector to binary file in c++?
                            
                                Is there an expression using modulo to do backwards wrap-around ("reverse overflow")?
                            
                                Can union be templated?
                            
                                Why do std::generate() and std::generate_n() require different iterators?
                            
                                Initializing a Char*[]
                            
                                Does it make any difference, using public slots instead of private slots in Qt?
                            
                                C++ Static Const Member Variable Usage
                            
                                Self destruction: this->MyClass::~MyClass() vs. this->~MyClass()
                            
                                Why ={} initialization doesn't work for tuple?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With