I'm trying to convert a hex char
to integer as fast as possible.
This is only one line:
int x = atoi(hex.c_str);
Is there a faster way?
Here, I have tried a more dynamic approach, and it's slightly faster.
int hextoint(char number) {
if (number == '0') {
return 0;
}
if (number == '1') {
return 1;
}
if (number == '2') {
return 2;
}
/*
* 3 through 8
*/
if (number == '9') {
return 9;
}
if (number == 'a') {
return 10;
}
if (number == 'b') {
return 11;
}
if (number == 'c') {
return 12;
}
if (number == 'd') {
return 13;
}
if (number == 'e') {
return 14;
}
if (number == 'f') {
return 15;
}
return -1;
}
To convert a hexadecimal string to a numberUse the ToInt32(String, Int32) method to convert the number expressed in base-16 to an integer. The first argument of the ToInt32(String, Int32) method is the string to convert. The second argument describes what base the number is expressed in; hexadecimal is base 16.
Sadly, no this is not possible. You can disassemble the code into an object listing using objdump -D but you would need some knowledge of the target system even to identify which parts are code and data.
The atoi() and atol() functions convert a character string containing decimal integer constants, but the strtol() and strtoul() functions can convert a character string containing a integer constant in octal, decimal, hexadecimal, or a base specified by the base parameter.
Well, that's a weird question. Converting a single hex char into an integer is so fast, that it is really hard to tell which is faster, because all methods are almost likely faster than the code you write in order to use them =)
I'll assume the following things:
eax
.Now here are several methods for solving the problem: the first one based on lookup, two based on ternary operator, the last one based on bit operations:
int hextoint_lut(char x) {
static char lut[256] = {???};
return lut[uint8_t(x)];
}
int hextoint_cond(char x) {
uint32_t dig = x - '0';
uint32_t alp = dig + ('0' - 'a' + 10);
return dig <= 9U ? dig : alp;
}
int hextoint_cond2(char x) {
uint32_t offset = (uint8_t(x) <= uint8_t('9') ? '0' : 'a' - 10);
return uint8_t(x) - offset;
}
int hextoint_bit(char x) {
int b = uint8_t(x);
int mask = (('9' - b) >> 31);
int offset = '0' + (mask & int('a' - '0' - 10));
return b - offset;
}
Here are the corresponding assembly listings generated (only the relevant parts are shown):
;hextoint_lut;
movsx eax, BYTE PTR [rax+rcx] ; just load the byte =)
;hextoint_cond;
sub edx, 48 ; subtract '0'
cmp edx, 9 ; compare to '9'
lea eax, DWORD PTR [rdx-39] ; add ('0' - 'a' + 10)
cmovbe eax, edx ; choose between two cases in branchless way
;hextoint_cond2; ; (modified slightly)
mov eax, 48
mov edx, 87 ; set two offsets to registers
cmp ecx, 57 ; compare with '9'
cmovbe edx, eax ; choose one offset
sub ecx, edx ; subtract the offset
;hextoint_bit;
mov ecx, 57 ; load '9'
sub ecx, eax ; get '9' - x
sar ecx, 31 ; convert to mask if negative
and ecx, 39 ; set to 39 (for x > '9')
sub eax, ecx ; subtract 39 or 0
sub eax, 48 ; subtract '0'
I'll try to estimate number of cycles taken by each approach in throughput sense, which is essentially the time spent per one input number when a lot of numbers are processed at once. Consider a Sandy Bridge architecture as an example.
The hextoint_lut
function consists of a single memory load, which takes 1 uop on port 2 or 3. Both of these ports are dedicated to memory loads, and they also have address calculation inside, which are capable of doing rax+rcx
with no additional cost. There are two such ports, each can do one uop in a cycle. So supposedly this version would take 0.5 clock time. If we have to load input number from memory, that would require one more memory load per value, so the total cost would be 1 clock.
The hextoint_cond
version has 4 instructions, but cmov
is broken into two separate uops. So there are 5 uops in total, each can be processed on any of the three arithmetic ports 0, 1, and 5. So supposedly it would take 5/3 cycles time. Note that memory load ports are free, so the time would not increase even if you have to load the input value from memory.
The hextoint_cond2
version has 5 instructions. But in a tight loop the constants can be preloaded to registers, so there would be only comparison, cmov and subtraction. They are 4 uops in total, giving 4/3 cycles per value (even with memory read).
The hextoint_bit
version is a solution which is guaranteed to have no branches and lookup, which is handy if you do not want to check always whether your compiler generated a cmov instruction. The first mov is free, since the constant can be preloaded in a tight loop. The rest are 5 arithmetic instructions, which a 5 uops in ports 0, 1, 5. So it should take 5/3 cycles (even with a memory read).
I have performed a benchmark for the C++ functions described above. In a benchmark, 64 KB of random data is generated, then each function is run many times on this data. All the results are added to checksum to ensure that compiler does not remove the code. Manual 8x unrolling is used. I have tested on a Ivy Bridge 3.4 Ghz core, which is very similar to Sandy Bridge. Each string of output contains: name of function, total time taken by benchmark, number of cycles per input value, sum of all outputs.
Benchmark code
MSVC2013 x64 /O2:
hextoint_lut: 0.741 sec, 1.2 cycles (check: -1022918656)
hextoint_cond: 1.925 sec, 3.0 cycles (check: -1022918656)
hextoint_cond2: 1.660 sec, 2.6 cycles (check: -1022918656)
hextoint_bit: 1.400 sec, 2.2 cycles (check: -1022918656)
GCC 4.8.3 x64 -O3 -fno-tree-vectorize
hextoint_lut: 0.702 sec, 1.1 cycles (check: -1114112000)
hextoint_cond: 1.513 sec, 2.4 cycles (check: -1114112000)
hextoint_cond2: 2.543 sec, 4.0 cycles (check: -1114112000)
hextoint_bit: 1.544 sec, 2.4 cycles (check: -1114112000)
GCC 4.8.3 x64 -O3
hextoint_lut: 0.702 sec, 1.1 cycles (check: -1114112000)
hextoint_cond: 0.717 sec, 1.1 cycles (check: -1114112000)
hextoint_cond2: 0.468 sec, 0.7 cycles (check: -1114112000)
hextoint_bit: 0.577 sec, 0.9 cycles (check: -1114112000)
Clearly, LUT approach takes one cycle per value (as predicted). The other approaches normally take from 2.2 to 2.6 cycles per value. In case of GCC, hextoint_cond2
is slow because compiler uses cmp+sbb+and magic instead of desired cmov instructions. Also note that by default GCC vectorizes most of the approaches (last paragraph), which provides expectedly faster results than the unvectorizable LUT approach. Note that manual vectorization would give significantly greater boost.
Note that hextoint_cond
with ordinary conditional jump instead of cmov
would have a branch. Assuming random input hex digits, it will be mispredicted almost always. So performance would be terrible, I think.
I have analysed throughput performance. But if we have to process tons of input values, then we should definitely vectorize the conversion to get better speed. hextoint_cond
can be vectorized with SSE in a pretty straightforward way. It allows to process 16 bytes to 16 bytes by using only 4 instructions, taking about 2 cycles I suppose.
Note that in order to see any performance difference, you must ensure that all the input values fit into cache (L1 is the best case). If you read the input data from main memory, even std::atoi
is equally fast with the considered methods =)
Also, you should unroll your main loop 4x or even 8x for maximum performance (to remove looping overhead). As you might have already noticed, the speed of both methods highly depends on which operations are surrounding the code. E.g. adding a memory load doubles time taken by the first approach, but does not influence the other approaches.
P.S. Most likely you don't really need to optimize this.
This question may evidently have different answers on different systems, and in this sense it is ill-posed from the start. For example an i486 has no pipeline and a pentium has no SSE.
The correct question to ask would be: " what is the fastest way to convert a single char hex to dec in X system , e.g. i686 " .
Among approaches herein, the answer to this is actually the same or very very very nearly the same on a system with a multi-stage pipeline. Any system without a pipeline will bend towards the lookup table method (LUT), but if memory access is slow the conditional method (CEV), or the bitwise evaluation method (BEV), may profit depending of the speed of xor vs load for the given CPU.
(CEV) decomposes into 2 load effective addresses a comparison and a conditional move from registers which is not prone to mis-prediction. All these commands are pairable in the pentium pipeline. So they actually go in 1-cycle.
8d 57 d0 lea -0x30(%rdi),%edx
83 ff 39 cmp $0x39,%edi
8d 47 a9 lea -0x57(%rdi),%eax
0f 4e c2 cmovle %edx,%eax
The (LUT) decomposes into a mov between registers and mov from a data dependent memory location plus some nops for alignment, and should take the minimum of 1-cycle. As the previous there are only data dependencies.
48 63 ff movslq %edi,%rdi
8b 04 bd 00 1d 40 00 mov 0x401d00(,%rdi,4),%eax
The (BEV) is a different beast as it actually requires 2 movs + 2 xors + 1 and a conditional mov. These can also be nicely pipelined.
89 fa mov %edi,%edx
89 f8 mov %edi,%eax
83 f2 57 xor $0x57,%edx
83 f0 30 xor $0x30,%eax
83 e7 40 and $0x40,%edi
0f 45 c2 cmovne %edx,%eax
Of course, it is a very rare occasion that it is application critical (maybe Mars Pathfinder is a candidate) to convert just a signle char. Instead one would expect to convert a larger string by actually making a loop and calling that function.
Thus on such a scenario the code that is better vectorizable is the winner. The LUT does not vectorize, and BEV and CEV have better behaviour. In general such micro-optimization does not get you anywhere, write your code and let live (i.e. let the compiler run).
So I have actually built some tests in this sense that are easily reproducible on any system with a c++11 compiler and a random device source,such as any *nix system. If one does not permit vectorization -O2
CEV/LUT are almost equal, but once -O3
is set the advantage of writing code that is more decomposable shows the difference.
To summarised, if you have an old compiler use LUT, if your system is low-end or old consider BEV, otherwise the compiler will outsmart you and you should use CEV.
Problem: in question is to convert from the set of chars {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f} to the set of {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}. There are no capital letters under consideration.
The idea is to take advantage of the linearity of the ascii table in segments.
[Simple and easy]: Conditional evaluation -> CEV
int decfromhex(int const x)
{
return x<58?x-48:x-87;
}
[Dirty and complex]: Bitwise evaluation -> BEV
int decfromhex(int const x)
{
return 9*(x&16)+( x & 0xf );
}
[compile time]: Template conditional evaluation -> TCV
template<char n> int decfromhex()
{
int constexpr x = n;
return x<58 ? x-48 : x -87;
}
[Lookup table]: Lookup table -> LUT
int decfromhex(char n)
{
static int constexpr x[255]={
// fill everything with invalid, e.g. -1 except places\
// 48-57 and 97-102 where you place 0..15
};
return x[n];
}
Among all , the last seems to be the fastest at first look. The second is only at compile time and constant expression.
[RESULT] (Please verify): *BEV is the fastest among all and handles lower and upper case letter , but only marginal to CEV which does not handle capital letters. LUT becomes slower than both CEV and BEV as the size of the string increases.
An exemplary result for str-sizes 16-12384 can be found below ( lower is better )
The average time (100 runs ) along is show. The size of the bubble is the normal error.
The script for running the tests is available.
Tests have been performed for the conditional
CEV, the bitwise
BEV and the lookup table
LUT on a set of randomly generated strings. The tests are fairly simple and from:
Test source code
these are verifiable:
g++ -std=c++11 -O3 -march=native dectohex.cpp -o d2h
taskset -c 0 d2h
As a side note I have seen in practice version 3 to be much faster with older c++98 compilers.
[BOTTOM LINE]: use CEV without fear, unless you know your variables at compile time where you could use version TCV. The LUT should only be used after significant performance per use case evaluation, and probably with older compilers. Another case is when your set is larger i.e. {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,A,B,C,D,E,F} . This can be achieved as well. Finally if you are permormance hungry use BEV .
The results with the unordered_map have been removed since they have been just too slow to compare, or at best case may be as fast as the LUT solution.
Results from my personal pc on strings of size 12384/256 and for 100 strings:
g++ -DS=2 -DSTR_SIZE=256 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -2709
-------------------------------------------------------------------
(CEV) Total: 185568 nanoseconds - mean: 323.98 nanoseconds error: 88.2699 nanoseconds
(BEV) Total: 185568 nanoseconds - mean: 337.68 nanoseconds error: 113.784 nanoseconds
(LUT) Total: 229612 nanoseconds - mean: 667.89 nanoseconds error: 441.824 nanoseconds
-------------------------------------------------------------------
g++ -DS=2 -DSTR_SIZE=12384 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native hextodec.cpp -o d2h && taskset -c 0 ./h2d
-------------------------------------------------------------------
(CEV) Total: 5539902 nanoseconds - mean: 6229.1 nanoseconds error: 1052.45 nanoseconds
(BEV) Total: 5539902 nanoseconds - mean: 5911.64 nanoseconds error: 1547.27 nanoseconds
(LUT) Total: 6346209 nanoseconds - mean: 14384.6 nanoseconds error: 1795.71 nanoseconds
-------------------------------------------------------------------
Precision: 1 ns
The results from a system with GCC 4.9.3 compiled to the metal without the system being loaded on strings of size 256/12384 and for 100 strings
g++ -DS=2 -DSTR_SIZE=256 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -2882
-------------------------------------------------------------------
(CEV) Total: 237449 nanoseconds - mean: 444.17 nanoseconds error: 117.337 nanoseconds
(BEV) Total: 237449 nanoseconds - mean: 413.59 nanoseconds error: 109.973 nanoseconds
(LUT) Total: 262469 nanoseconds - mean: 731.61 nanoseconds error: 11.7507 nanoseconds
-------------------------------------------------------------------
Precision: 1 ns
g++ -DS=2 -DSTR_SIZE=12384 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -137532
-------------------------------------------------------------------
(CEV) Total: 6834796 nanoseconds - mean: 9138.93 nanoseconds error: 144.134 nanoseconds
(BEV) Total: 6834796 nanoseconds - mean: 8588.37 nanoseconds error: 4479.47 nanoseconds
(LUT) Total: 8395700 nanoseconds - mean: 24171.1 nanoseconds error: 1600.46 nanoseconds
-------------------------------------------------------------------
Precision: 1 ns
[HOW TO READ THE RESULTS]
The mean is shown on microseconds required to compute the string of the given size.
The total time for each test is given. The mean is computed as the sum/total of timings to compute one string ( no other code in that region but could be vectorized, and that's ok) . The error is the standard deviation of the timings.
The mean tell us what we should expect on average , and the error how much the timings have been following normality. In this case this is a fair measure of error only when it is small ( otherwise we should use something appropriate for positive distributions ). One usually should expect high errors in case of cache miss , processor scheduling, and many other factors.
The code has a unique macro defined to run the tests, permits to define compile time variables to set up the tests, and prints complete information such as:
g++ -DS=2 -DSTR_SIZE=64 -DSET_SIZE=1000 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -6935
-------------------------------------------------------------------
(CEV) Total: 947378 nanoseconds - mean: 300.871 nanoseconds error: 442.644 nanoseconds
(BEV) Total: 947378 nanoseconds - mean: 277.866 nanoseconds error: 43.7235 nanoseconds
(LUT) Total: 1040307 nanoseconds - mean: 375.877 nanoseconds error: 14.5706 nanoseconds
-------------------------------------------------------------------
For example to run the test with a 2sec
pause on a str of size 256
for a total of 10000
different strings, output timings in double precision
, and count in nanoseconds
the following command compiles and runs the test.
g++ -DS=2 -DSTR_SIZE=256 -DSET_SIZE=10000 -DUTYPE=double -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With