I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places.
If it's a speed gain, multiple atoi
results can be done in parallel. The strings are originally coming from a buffer of JSON data, so a multi-atoi function will have to do any required swizzling.
The algorithm I came up with is the following:
I'm targeting x86 and x86-64 architectures.
I know that AVX2 supports three operand Fused Multiply-Add so I'll be able to perform Sum = Number * Significant Digit + Sum.
That's where I got so far.
Is my algorithm correct? Is there a better way?
Is there a reference implementation for atoi using any SIMD instructions set?
The atoi() function converts a character string to an integer value. The input string is a sequence of characters that can be interpreted as a numeric value of the specified return type. The function stops reading the input string at the first character that it cannot recognize as part of a number.
Using atoi() First, atoi() converts C strings (null-terminated character arrays) to an integer, while stoi() converts the C++ string to an integer. Second, the atoi() function will silently fail if the string is not convertible to an int , while the stoi() function will simply throw an exception.
The atoi() function in C++ is defined in the cstdlib header. It accepts a string parameter that contains integer values and converts the passed string to an integer value. If the string is null or has any non-integer value, atoi in C++ silently fails the execution without throwing any error or exception.
Synopsis. The atoi() function converts a string of characters representing a numeral into a number of int . Similarly, atol() returns a long integer, and in C99, the atoll() function converts a string into an integer of type long long . The conversion ignores any leading whitespace characters (spaces, tabs, newlines).
The algorithm and its implementation is finished now. It's complete and (moderately) tested (Updated for less constant memory usage and tolerating plus-char).
The properties of this code are as follows:
int
and uint
, from MIN_INT=-2147483648
to MAX_INT=2147483647
and from MIN_UINT=0
to MAX_UINT=4294967295
'-'
char indicates a negative number (as sensible), a leading '+'
char is ignored0 = -0
About this implementation:
as
) using .intel_syntax noprefix
at the beginningThe approach of the algorithm:
- Pointer to number string is expected in ESI - Check if first char is '-', then indicate if negative number in EDX (**A**) - Check for leading zeros and EOS (**B**) - Check string for valid digits and get strlen() of valid chars (**C**) - Reverse string so that power of 10^0 is always at BYTE 15 10^1 is always at BYTE 14 10^2 is always at BYTE 13 10^3 is always at BYTE 12 10^4 is always at BYTE 11 ... and mask out all following chars (**D**) - Subtract saturated '0' from each of the 16 possible chars (**1**) - Take 16 consecutive byte-values and and split them to WORDs in two XMM-registers (**2**) P O N M L K J I | H G F E D C B A -> H G F E | D C B A (XMM0) P O N M | L K J I (XMM1) - Multiply each WORD by its place-value modulo 10000 (1,10,100,1000) (factors smaller then MAX_WORD, 4 factors per QWORD/halfXMM) (**2**) so we can horizontally combine twice before another multiply. The PMADDWD instruction can do this and the next step: - Horizontally add adjacent WORDs to DWORDs (**3**) H*1000+G*100 F*10+E*1 | D*1000+C*100 B*10+A*1 (XMM0) P*1000+O*100 N*10+M*1 | L*1000+K*100 J*10+I*1 (XMM1) - Horizontally add adjacent DWORDs from XMM0 and XMM1 to XMM0 (**4**) xmmDst[31-0] = xmm0[63-32] + xmm0[31-0] xmmDst[63-32] = xmm0[127-96] + xmm0[95-64] xmmDst[95-64] = xmm1[63-32] + xmm1[31-0] xmmDst[127-96] = xmm1[127-96] + xmm1[95-64] - Values in XMM0 are multiplied with the factors (**5**) P*1000+O*100+N*10+M*1 (DWORD factor 1000000000000 = too big for DWORD, but possibly useful for QWORD number strings) L*1000+K*100+J*10+I*1 (DWORD factor 100000000) H*1000+G*100+F*10+E*1 (DWORD factor 10000) D*1000+C*100+B*10+A*1 (DWORD factor 1) - The last step is adding these four DWORDs together with 2*PHADDD emulated by 2*(PSHUFD+PADDD) - xmm0[31-0] = xmm0[63-32] + xmm0[31-0] (**6**) xmm0[63-32] = xmm0[127-96] + xmm0[95-64] (the upper QWORD contains the same and is ignored) - xmm0[31-0] = xmm0[63-32] + xmm0[31-0] (**7**) - If the number is negative (indicated in EDX by 000...0=pos or 111...1=neg), negate it(**8**)
And the sample implementation in GNU Assembler with intel syntax:
.intel_syntax noprefix .data .align 64 ddqDigitRange: .byte '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0 ddqShuffleMask:.byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 ddqFactor1: .word 1,10,100,1000, 1,10,100,1000 ddqFactor2: .long 1,10000,100000000,0 .text _start: mov esi, lpInputNumberString /* (**A**) indicate negative number in EDX */ mov eax, -1 xor ecx, ecx xor edx, edx mov bl, byte ptr [esi] cmp bl, '-' cmove edx, eax cmp bl, '+' cmove ecx, eax sub esi, edx sub esi, ecx /* (**B**)remove leading zeros */ xor eax,eax /* return value ZERO */ remove_leading_zeros: inc esi cmp byte ptr [esi-1], '0' /* skip leading zeros */ je remove_leading_zeros cmp byte ptr [esi-1], 0 /* catch empty string/number */ je FINISH dec esi /* check for valid digit-chars and invert from front to back */ pxor xmm2, xmm2 movdqa xmm0, xmmword ptr [ddqDigitRange] movdqu xmm1, xmmword ptr [esi] pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */ jo FINISH /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */ mov al , '0' /* value to subtract from chars */ sub ecx, 16 /* len-16=negative to zero for shuffle mask */ movd xmm0, ecx pshufb xmm0, xmm2 /* broadcast CL to all 16 BYTEs */ paddb xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */ pshufb xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */ movd xmm0, eax /* AL='0' from above */ pshufb xmm0, xmm2 /* broadcast AL to XMM0 */ psubusb xmm1, xmm0 /* (**1**) */ movdqa xmm0, xmm1 punpcklbw xmm0, xmm2 /* (**2**) */ punpckhbw xmm1, xmm2 pmaddwd xmm0, xmmword ptr [ddqFactor1] /* (**3**) */ pmaddwd xmm1, xmmword ptr [ddqFactor1] phaddd xmm0, xmm1 /* (**4**) */ pmulld xmm0, xmmword ptr [ddqFactor2] /* (**5**) */ pshufd xmm1, xmm0, 0b11101110 /* (**6**) */ paddd xmm0, xmm1 pshufd xmm1, xmm0, 0b01010101 /* (**7**) */ paddd xmm0, xmm1 movd eax, xmm0 /* negate if negative number */ add eax, edx /* (**8**) */ xor eax, edx FINISH: /* EAX is return (u)int value */
The result of Intel-IACA Throughput Analysis for Haswell 32-bit:
Throughput Analysis Report -------------------------- Block Throughput: 16.10 Cycles Throughput Bottleneck: InterIteration Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------------- | Cycles | 9.5 0.0 | 10.0 | 4.5 4.5 | 4.5 4.5 | 0.0 | 11.1 | 11.4 | 0.0 | --------------------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 0* | | | | | | | | | | xor eax, eax | 0* | | | | | | | | | | xor ecx, ecx | 0* | | | | | | | | | | xor edx, edx | 1 | | 0.1 | | | | | 0.9 | | | dec eax | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | mov bl, byte ptr [esi] | 1 | | | | | | | 1.0 | | CP | cmp bl, 0x2d | 2 | 0.1 | 0.2 | | | | | 1.8 | | CP | cmovz edx, eax | 1 | 0.1 | 0.5 | | | | | 0.4 | | CP | cmp bl, 0x2b | 2 | 0.5 | 0.2 | | | | | 1.2 | | CP | cmovz ecx, eax | 1 | 0.2 | 0.5 | | | | | 0.2 | | CP | sub esi, edx | 1 | 0.2 | 0.5 | | | | | 0.3 | | CP | sub esi, ecx | 0* | | | | | | | | | | xor eax, eax | 1 | 0.3 | 0.1 | | | | | 0.6 | | CP | inc esi | 2^ | 0.3 | | 0.5 0.5 | 0.5 0.5 | | | 0.6 | | | cmp byte ptr [esi-0x1], 0x30 | 0F | | | | | | | | | | jz 0xfffffffb | 2^ | 0.6 | | 0.5 0.5 | 0.5 0.5 | | | 0.4 | | | cmp byte ptr [esi-0x1], 0x0 | 0F | | | | | | | | | | jz 0x8b | 1 | 0.1 | 0.9 | | | | | | | CP | dec esi | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | movdqa xmm0, xmmword ptr [0x80492f0] | 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | CP | movdqu xmm1, xmmword ptr [esi] | 0* | | | | | | | | | | pxor xmm2, xmm2 | 3 | 2.0 | 1.0 | | | | | | | CP | pcmpistri xmm0, xmm1, 0x14 | 1 | | | | | | | 1.0 | | | jo 0x6e | 1 | | 0.4 | | | | 0.1 | 0.5 | | | mov al, 0x30 | 1 | 0.1 | 0.5 | | | | 0.1 | 0.3 | | CP | sub ecx, 0x10 | 1 | | | | | | 1.0 | | | CP | movd xmm0, ecx | 1 | | | | | | 1.0 | | | CP | pshufb xmm0, xmm2 | 2^ | | 1.0 | 0.5 0.5 | 0.5 0.5 | | | | | CP | paddb xmm0, xmmword ptr [0x80492c0] | 1 | | | | | | 1.0 | | | CP | pshufb xmm1, xmm0 | 1 | | | | | | 1.0 | | | | movd xmm0, eax | 1 | | | | | | 1.0 | | | | pshufb xmm0, xmm2 | 1 | | 1.0 | | | | | | | CP | psubusb xmm1, xmm0 | 0* | | | | | | | | | CP | movdqa xmm0, xmm1 | 1 | | | | | | 1.0 | | | CP | punpcklbw xmm0, xmm2 | 1 | | | | | | 1.0 | | | | punpckhbw xmm1, xmm2 | 2^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | CP | pmaddwd xmm0, xmmword ptr [0x80492d0] | 2^ | 1.0 | | 0.5 0.5 | 0.5 0.5 | | | | | | pmaddwd xmm1, xmmword ptr [0x80492d0] | 3 | | 1.0 | | | | 2.0 | | | CP | phaddd xmm0, xmm1 | 3^ | 2.0 | | 0.5 0.5 | 0.5 0.5 | | | | | CP | pmulld xmm0, xmmword ptr [0x80492e0] | 1 | | | | | | 1.0 | | | CP | pshufd xmm1, xmm0, 0xee | 1 | | 1.0 | | | | | | | CP | paddd xmm0, xmm1 | 1 | | | | | | 1.0 | | | CP | pshufd xmm1, xmm0, 0x55 | 1 | | 1.0 | | | | | | | CP | paddd xmm0, xmm1 | 1 | 1.0 | | | | | | | | CP | movd eax, xmm0 | 1 | | | | | | | 1.0 | | CP | add eax, edx | 1 | | | | | | | 1.0 | | CP | xor eax, edx Total Num Of Uops: 51
The result of Intel-IACA Latency Analysis for Haswell 32-bit:
Latency Analysis Report --------------------------- Latency: 64 Cycles N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis The Resource delay is counted since all the sources of the instructions are ready and until the needed resource becomes available | Inst | Resource Delay In Cycles | | | Num | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | FE | | ------------------------------------------------------------------------- | 0 | | | | | | | | | | | xor eax, eax | 1 | | | | | | | | | | | xor ecx, ecx | 2 | | | | | | | | | | | xor edx, edx | 3 | | | | | | | | | | | dec eax | 4 | | | | | | | | | 1 | CP | mov bl, byte ptr [esi] | 5 | | | | | | | | | | CP | cmp bl, 0x2d | 6 | | | | | | | | | | CP | cmovz edx, eax | 7 | | | | | | | | | | CP | cmp bl, 0x2b | 8 | | | | | | | 1 | | | CP | cmovz ecx, eax | 9 | | | | | | | | | | CP | sub esi, edx | 10 | | | | | | | | | | CP | sub esi, ecx | 11 | | | | | | | | | 3 | | xor eax, eax | 12 | | | | | | | | | | CP | inc esi | 13 | | | | | | | | | | | cmp byte ptr [esi-0x1], 0x30 | 14 | | | | | | | | | | | jz 0xfffffffb | 15 | | | | | | | | | | | cmp byte ptr [esi-0x1], 0x0 | 16 | | | | | | | | | | | jz 0x8b | 17 | | | | | | | | | | CP | dec esi | 18 | | | | | | | | | 4 | | movdqa xmm0, xmmword ptr [0x80492f0] | 19 | | | | | | | | | | CP | movdqu xmm1, xmmword ptr [esi] | 20 | | | | | | | | | 5 | | pxor xmm2, xmm2 | 21 | | | | | | | | | | CP | pcmpistri xmm0, xmm1, 0x14 | 22 | | | | | | | | | | | jo 0x6e | 23 | | | | | | | | | 6 | | mov al, 0x30 | 24 | | | | | | | | | | CP | sub ecx, 0x10 | 25 | | | | | | | | | | CP | movd xmm0, ecx | 26 | | | | | | | | | | CP | pshufb xmm0, xmm2 | 27 | | | | | | | | | 7 | CP | paddb xmm0, xmmword ptr [0x80492c0] | 28 | | | | | | | | | | CP | pshufb xmm1, xmm0 | 29 | | | | | | 1 | | | | | movd xmm0, eax | 30 | | | | | | 1 | | | | | pshufb xmm0, xmm2 | 31 | | | | | | | | | | CP | psubusb xmm1, xmm0 | 32 | | | | | | | | | | CP | movdqa xmm0, xmm1 | 33 | | | | | | | | | | CP | punpcklbw xmm0, xmm2 | 34 | | | | | | | | | | | punpckhbw xmm1, xmm2 | 35 | | | | | | | | | 9 | CP | pmaddwd xmm0, xmmword ptr [0x80492d0] | 36 | | | | | | | | | 9 | | pmaddwd xmm1, xmmword ptr [0x80492d0] | 37 | | | | | | | | | | CP | phaddd xmm0, xmm1 | 38 | | | | | | | | | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0] | 39 | | | | | | | | | | CP | pshufd xmm1, xmm0, 0xee | 40 | | | | | | | | | | CP | paddd xmm0, xmm1 | 41 | | | | | | | | | | CP | pshufd xmm1, xmm0, 0x55 | 42 | | | | | | | | | | CP | paddd xmm0, xmm1 | 43 | | | | | | | | | | CP | movd eax, xmm0 | 44 | | | | | | | | | | CP | add eax, edx | 45 | | | | | | | | | | CP | xor eax, edx Resource Conflict on Critical Paths: ----------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | ----------------------------------------------------------------- | Cycles | 0 0 | 0 | 0 0 | 0 0 | 0 | 0 | 1 | 0 | ----------------------------------------------------------------- List Of Delays On Critical Paths ------------------------------- 6 --> 8 1 Cycles Delay On Port6
An alternative handling suggested in comments by Peter Cordes is replacing the last two add+xor
instructions by an imul
. This concentration of OpCodes is likely to be superior. Unfortunately IACA doesn't support that instruction and throws a ! - instruction not supported, was not accounted in Analysis
comment. Nevertheless, although I like the reduction of OpCodes and reduction from (2uops, 2c latency) to (1 uop, 3c latency - "worse latency, but still one m-op on AMD"), I prefer to leave it to the implementer which way to choose. I haven't checked if the following code is sufficient for parsing any number. It is just mentioned for completeness and code modifications in other parts may be necessary (especially handling positive numbers).
The alternative may be replacing the last two lines with:
... /* negate if negative number */ imul eax, edx FINISH: /* EAX is return (u)int value */
I would approach this problem like this:
'0'
from each character.9
.1000
, 100
, 10
, 1
) and multiply with that.9
was found in step four, goto step 2.You could simplify the algorithm by zeroing out all entries beginning with the wrong one in step 5 instead of shifting and then dividing by an appropriate power of ten in the end.
Please keep in mind that this algorithm reads past the end of the string and is thus not a drop-in replacement for atoi
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With