I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places. If it's a speed gain, multiple <code>atoi</code> results can be done in parallel. The strings are originally coming from a buffer of JSON data, so a multi-atoi function will have to do any required swizzling. The algorithm I came up with is the following: <ol> <li>I can initialize a vector of length N in the following fashion: [10^N..10^1]</li> <li>I convert each character in the buffer to an integer and place them in another vector.</li> <li>I take each number in the significant digits vector and multiply it by the matching number in the numbers vector and sum the results. </li> </ol> I'm targeting x86 and x86-64 architectures. I know that AVX2 supports three operand Fused Multiply-Add so I'll be able to perform Sum = Number * Significant Digit + Sum. That's where I got so far. Is my algorithm correct? Is there a better way? Is there a reference implementation for atoi using any SIMD instructions set?

I would approach this problem like this: <ol> <li>Initialize the accumulator to 0.</li> <li>Load the next four characters of the string into an SSE register.</li> <li>Subtract the value <code>'0'</code> from each character.</li> <li>Find the first value in the vector whose unsigned value is greater than <code>9</code>.</li> <li>If a value was found, shift the components of the vector to the right so that the value found in the previous step was just shifted out.</li> <li>Load a vector containing powers of ten (<code>1000</code>, <code>100</code>, <code>10</code>, <code>1</code>) and multiply with that.</li> <li>Compute the sum of all entries in the vector.</li> <li>Multiply the accumulator with an appropriate value (depending on the number of shifts in step 5) and add the vector. You can use an FMA instruction for that, but I don't know if such an instruction exists for integers.</li> <li>If no value greater than <code>9</code> was found in step four, goto step 2.</li> <li>Return the accumulator.</li> </ol> You could simplify the algorithm by zeroing out all entries beginning with the wrong one in step 5 instead of shifting and then dividing by an appropriate power of ten in the end. Please keep in mind that this algorithm reads past the end of the string and is thus not a drop-in replacement for <code>atoi</code>.

How to implement atoi using SIMD?

Tags:

c++

x86

simd

atoi

sse

I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places.

If it's a speed gain, multiple atoi results can be done in parallel. The strings are originally coming from a buffer of JSON data, so a multi-atoi function will have to do any required swizzling.

The algorithm I came up with is the following:

I can initialize a vector of length N in the following fashion: [10^N..10^1]
I convert each character in the buffer to an integer and place them in another vector.
I take each number in the significant digits vector and multiply it by the matching number in the numbers vector and sum the results.

I'm targeting x86 and x86-64 architectures.

I know that AVX2 supports three operand Fused Multiply-Add so I'll be able to perform Sum = Number * Significant Digit + Sum.
That's where I got so far.
Is my algorithm correct? Is there a better way?
Is there a reference implementation for atoi using any SIMD instructions set?

370

asked Feb 01 '16 09:02

the_drow

2 Answers

The algorithm and its implementation is finished now. It's complete and (moderately) tested (Updated for less constant memory usage and tolerating plus-char).

The properties of this code are as follows:

Works for int and uint, from MIN_INT=-2147483648 to MAX_INT=2147483647 and from MIN_UINT=0 to MAX_UINT=4294967295
A leading '-' char indicates a negative number (as sensible), a leading '+' char is ignored
Leading zeros (with or without sign char) are ignored
Overflow is ignored - bigger numbers just wraparound
Zero length strings result in value 0 = -0
Invalid characters are recognized and the conversion ends at the first invalid char
At least 16 bytes after the last leading zero must be accessible and possible security implications of reading after EOS are left to the caller
Only SSE4.2 is needed

About this implementation:

This code sample can be run with GNU Assembler(as) using .intel_syntax noprefix at the beginning
Data footprint for constants is 64 bytes (4*128 bit XMM) equalling one cache line.
Code footprint is 46 instructions with 51 micro-Ops and 64 cycles latency
One loop for removal of leading zeros, otherwise no jumps except for error handling, so...
Time complexity is O(1)

The approach of the algorithm:

- Pointer to number string is expected in ESI - Check if first char is '-', then indicate if negative number in EDX (**A**) - Check for leading zeros and EOS (**B**) - Check string for valid digits and get strlen() of valid chars (**C**) - Reverse string so that power of    10^0 is always at BYTE 15   10^1 is always at BYTE 14   10^2 is always at BYTE 13   10^3 is always at BYTE 12   10^4 is always at BYTE 11    ...    and mask out all following chars (**D**) - Subtract saturated '0' from each of the 16 possible chars (**1**) - Take 16 consecutive byte-values and and split them to WORDs    in two XMM-registers (**2**)   P O N M L K J I  | H G F E D C B A ->     H   G   F   E  |   D   C   B   A (XMM0)     P   O   N   M  |   L   K   J   I (XMM1) - Multiply each WORD by its place-value modulo 10000 (1,10,100,1000)   (factors smaller then MAX_WORD, 4 factors per QWORD/halfXMM)   (**2**) so we can horizontally combine twice before another multiply.   The PMADDWD instruction can do this and the next step: - Horizontally add adjacent WORDs to DWORDs (**3**)   H*1000+G*100  F*10+E*1  |  D*1000+C*100  B*10+A*1 (XMM0)   P*1000+O*100  N*10+M*1  |  L*1000+K*100  J*10+I*1 (XMM1) - Horizontally add adjacent DWORDs from XMM0 and XMM1 to XMM0 (**4**)   xmmDst[31-0]   = xmm0[63-32]  + xmm0[31-0]   xmmDst[63-32]  = xmm0[127-96] + xmm0[95-64]   xmmDst[95-64]  = xmm1[63-32]  + xmm1[31-0]   xmmDst[127-96] = xmm1[127-96] + xmm1[95-64] - Values in XMM0 are multiplied with the factors (**5**)   P*1000+O*100+N*10+M*1 (DWORD factor 1000000000000 = too big for DWORD, but possibly useful for QWORD number strings)   L*1000+K*100+J*10+I*1 (DWORD factor 100000000)   H*1000+G*100+F*10+E*1 (DWORD factor 10000)   D*1000+C*100+B*10+A*1 (DWORD factor 1) - The last step is adding these four DWORDs together with 2*PHADDD emulated by 2*(PSHUFD+PADDD)   - xmm0[31-0]  = xmm0[63-32]  + xmm0[31-0]   (**6**)     xmm0[63-32] = xmm0[127-96] + xmm0[95-64]       (the upper QWORD contains the same and is ignored)   - xmm0[31-0]  = xmm0[63-32]  + xmm0[31-0]   (**7**) - If the number is negative (indicated in EDX by 000...0=pos or 111...1=neg), negate it(**8**)

And the sample implementation in GNU Assembler with intel syntax:

.intel_syntax noprefix .data   .align 64     ddqDigitRange: .byte  '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0     ddqShuffleMask:.byte  15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0      ddqFactor1:    .word  1,10,100,1000, 1,10,100,1000       ddqFactor2:    .long  1,10000,100000000,0 .text     _start:    mov   esi, lpInputNumberString    /* (**A**) indicate negative number in EDX */    mov   eax, -1    xor   ecx, ecx    xor   edx, edx    mov   bl,  byte ptr [esi]    cmp   bl,  '-'    cmove edx, eax    cmp   bl,  '+'    cmove ecx, eax    sub   esi, edx    sub   esi, ecx    /* (**B**)remove leading zeros */    xor   eax,eax               /* return value ZERO */   remove_leading_zeros:    inc   esi    cmp   byte ptr [esi-1], '0'  /* skip leading zeros */   je remove_leading_zeros    cmp   byte ptr [esi-1], 0    /* catch empty string/number */   je FINISH    dec   esi    /* check for valid digit-chars and invert from front to back */    pxor      xmm2, xmm2             movdqa    xmm0, xmmword ptr [ddqDigitRange]    movdqu    xmm1, xmmword ptr [esi]    pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */   jo FINISH             /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */    mov al , '0'         /* value to subtract from chars */    sub ecx, 16          /* len-16=negative to zero for shuffle mask */    movd      xmm0, ecx    pshufb    xmm0, xmm2 /* broadcast CL to all 16 BYTEs */    paddb     xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */    pshufb    xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */    movd      xmm0, eax                         /* AL='0' from above */    pshufb    xmm0, xmm2                        /* broadcast AL to XMM0 */    psubusb   xmm1, xmm0                        /* (**1**) */    movdqa    xmm0, xmm1    punpcklbw xmm0, xmm2                        /* (**2**) */    punpckhbw xmm1, xmm2    pmaddwd   xmm0, xmmword ptr [ddqFactor1]    /* (**3**) */    pmaddwd   xmm1, xmmword ptr [ddqFactor1]    phaddd    xmm0, xmm1                        /* (**4**) */    pmulld    xmm0, xmmword ptr [ddqFactor2]    /* (**5**) */    pshufd    xmm1, xmm0, 0b11101110            /* (**6**) */    paddd     xmm0, xmm1    pshufd    xmm1, xmm0, 0b01010101            /* (**7**) */    paddd     xmm0, xmm1    movd      eax, xmm0    /* negate if negative number */                  add       eax, edx                          /* (**8**) */    xor       eax, edx   FINISH:    /* EAX is return (u)int value */

The result of Intel-IACA Throughput Analysis for Haswell 32-bit:

Throughput Analysis Report -------------------------- Block Throughput: 16.10 Cycles       Throughput Bottleneck: InterIteration  Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   | --------------------------------------------------------------------------------------- | Cycles | 9.5    0.0  | 10.0 | 4.5    4.5  | 4.5    4.5  | 0.0  | 11.1 | 11.4 | 0.0  | ---------------------------------------------------------------------------------------  N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis  | Num Of |                    Ports pressure in cycles                     |    | |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    | --------------------------------------------------------------------------------- |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax |   0*   |           |     |           |           |     |     |     |     |    | xor ecx, ecx |   0*   |           |     |           |           |     |     |     |     |    | xor edx, edx |   1    |           | 0.1 |           |           |     |     | 0.9 |     |    | dec eax |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov bl, byte ptr [esi] |   1    |           |     |           |           |     |     | 1.0 |     | CP | cmp bl, 0x2d |   2    | 0.1       | 0.2 |           |           |     |     | 1.8 |     | CP | cmovz edx, eax |   1    | 0.1       | 0.5 |           |           |     |     | 0.4 |     | CP | cmp bl, 0x2b |   2    | 0.5       | 0.2 |           |           |     |     | 1.2 |     | CP | cmovz ecx, eax |   1    | 0.2       | 0.5 |           |           |     |     | 0.2 |     | CP | sub esi, edx |   1    | 0.2       | 0.5 |           |           |     |     | 0.3 |     | CP | sub esi, ecx |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax |   1    | 0.3       | 0.1 |           |           |     |     | 0.6 |     | CP | inc esi |   2^   | 0.3       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.6 |     |    | cmp byte ptr [esi-0x1], 0x30 |   0F   |           |     |           |           |     |     |     |     |    | jz 0xfffffffb |   2^   | 0.6       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.4 |     |    | cmp byte ptr [esi-0x1], 0x0 |   0F   |           |     |           |           |     |     |     |     |    | jz 0x8b |   1    | 0.1       | 0.9 |           |           |     |     |     |     | CP | dec esi |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | movdqa xmm0, xmmword ptr [0x80492f0] |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | movdqu xmm1, xmmword ptr [esi] |   0*   |           |     |           |           |     |     |     |     |    | pxor xmm2, xmm2 |   3    | 2.0       | 1.0 |           |           |     |     |     |     | CP | pcmpistri xmm0, xmm1, 0x14 |   1    |           |     |           |           |     |     | 1.0 |     |    | jo 0x6e |   1    |           | 0.4 |           |           |     | 0.1 | 0.5 |     |    | mov al, 0x30 |   1    | 0.1       | 0.5 |           |           |     | 0.1 | 0.3 |     | CP | sub ecx, 0x10 |   1    |           |     |           |           |     | 1.0 |     |     | CP | movd xmm0, ecx |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm0, xmm2 |   2^   |           | 1.0 | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | paddb xmm0, xmmword ptr [0x80492c0] |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm1, xmm0 |   1    |           |     |           |           |     | 1.0 |     |     |    | movd xmm0, eax |   1    |           |     |           |           |     | 1.0 |     |     |    | pshufb xmm0, xmm2 |   1    |           | 1.0 |           |           |     |     |     |     | CP | psubusb xmm1, xmm0 |   0*   |           |     |           |           |     |     |     |     | CP | movdqa xmm0, xmm1 |   1    |           |     |           |           |     | 1.0 |     |     | CP | punpcklbw xmm0, xmm2 |   1    |           |     |           |           |     | 1.0 |     |     |    | punpckhbw xmm1, xmm2 |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmaddwd xmm0, xmmword ptr [0x80492d0] |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | pmaddwd xmm1, xmmword ptr [0x80492d0] |   3    |           | 1.0 |           |           |     | 2.0 |     |     | CP | phaddd xmm0, xmm1 |   3^   | 2.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmulld xmm0, xmmword ptr [0x80492e0] |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xee |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1 |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0x55 |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1 |   1    | 1.0       |     |           |           |     |     |     |     | CP | movd eax, xmm0 |   1    |           |     |           |           |     |     | 1.0 |     | CP | add eax, edx |   1    |           |     |           |           |     |     | 1.0 |     | CP | xor eax, edx Total Num Of Uops: 51

The result of Intel-IACA Latency Analysis for Haswell 32-bit:

Latency Analysis Report --------------------------- Latency: 64 Cycles  N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened # - ESP Tracking sync uop was issued @ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected ! - instruction not supported, was not accounted in Analysis  The Resource delay is counted since all the sources of the instructions are ready and until the needed resource becomes available  | Inst |                 Resource Delay In Cycles                  |    | | Num  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  | FE |    | ------------------------------------------------------------------------- |  0   |         |    |         |         |    |    |    |    |    |    | xor eax, eax |  1   |         |    |         |         |    |    |    |    |    |    | xor ecx, ecx |  2   |         |    |         |         |    |    |    |    |    |    | xor edx, edx |  3   |         |    |         |         |    |    |    |    |    |    | dec eax |  4   |         |    |         |         |    |    |    |    | 1  | CP | mov bl, byte ptr [esi] |  5   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2d |  6   |         |    |         |         |    |    |    |    |    | CP | cmovz edx, eax |  7   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2b |  8   |         |    |         |         |    |    | 1  |    |    | CP | cmovz ecx, eax |  9   |         |    |         |         |    |    |    |    |    | CP | sub esi, edx | 10   |         |    |         |         |    |    |    |    |    | CP | sub esi, ecx | 11   |         |    |         |         |    |    |    |    | 3  |    | xor eax, eax | 12   |         |    |         |         |    |    |    |    |    | CP | inc esi | 13   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x30 | 14   |         |    |         |         |    |    |    |    |    |    | jz 0xfffffffb | 15   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x0 | 16   |         |    |         |         |    |    |    |    |    |    | jz 0x8b | 17   |         |    |         |         |    |    |    |    |    | CP | dec esi | 18   |         |    |         |         |    |    |    |    | 4  |    | movdqa xmm0, xmmword ptr [0x80492f0] | 19   |         |    |         |         |    |    |    |    |    | CP | movdqu xmm1, xmmword ptr [esi] | 20   |         |    |         |         |    |    |    |    | 5  |    | pxor xmm2, xmm2 | 21   |         |    |         |         |    |    |    |    |    | CP | pcmpistri xmm0, xmm1, 0x14 | 22   |         |    |         |         |    |    |    |    |    |    | jo 0x6e | 23   |         |    |         |         |    |    |    |    | 6  |    | mov al, 0x30 | 24   |         |    |         |         |    |    |    |    |    | CP | sub ecx, 0x10 | 25   |         |    |         |         |    |    |    |    |    | CP | movd xmm0, ecx | 26   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm0, xmm2 | 27   |         |    |         |         |    |    |    |    | 7  | CP | paddb xmm0, xmmword ptr [0x80492c0] | 28   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm1, xmm0 | 29   |         |    |         |         |    | 1  |    |    |    |    | movd xmm0, eax | 30   |         |    |         |         |    | 1  |    |    |    |    | pshufb xmm0, xmm2 | 31   |         |    |         |         |    |    |    |    |    | CP | psubusb xmm1, xmm0 | 32   |         |    |         |         |    |    |    |    |    | CP | movdqa xmm0, xmm1 | 33   |         |    |         |         |    |    |    |    |    | CP | punpcklbw xmm0, xmm2 | 34   |         |    |         |         |    |    |    |    |    |    | punpckhbw xmm1, xmm2 | 35   |         |    |         |         |    |    |    |    | 9  | CP | pmaddwd xmm0, xmmword ptr [0x80492d0] | 36   |         |    |         |         |    |    |    |    | 9  |    | pmaddwd xmm1, xmmword ptr [0x80492d0] | 37   |         |    |         |         |    |    |    |    |    | CP | phaddd xmm0, xmm1 | 38   |         |    |         |         |    |    |    |    | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0] | 39   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0xee | 40   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1 | 41   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0x55 | 42   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1 | 43   |         |    |         |         |    |    |    |    |    | CP | movd eax, xmm0 | 44   |         |    |         |         |    |    |    |    |    | CP | add eax, edx | 45   |         |    |         |         |    |    |    |    |    | CP | xor eax, edx  Resource Conflict on Critical Paths:  ----------------------------------------------------------------- |  Port  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  | ----------------------------------------------------------------- | Cycles | 0    0  | 0  | 0    0  | 0    0  | 0  | 0  | 1  | 0  | -----------------------------------------------------------------  List Of Delays On Critical Paths ------------------------------- 6 --> 8 1 Cycles Delay On Port6

An alternative handling suggested in comments by Peter Cordes is replacing the last two add+xor instructions by an imul. This concentration of OpCodes is likely to be superior. Unfortunately IACA doesn't support that instruction and throws a ! - instruction not supported, was not accounted in Analysis comment. Nevertheless, although I like the reduction of OpCodes and reduction from (2uops, 2c latency) to (1 uop, 3c latency - "worse latency, but still one m-op on AMD"), I prefer to leave it to the implementer which way to choose. I haven't checked if the following code is sufficient for parsing any number. It is just mentioned for completeness and code modifications in other parts may be necessary (especially handling positive numbers).

The alternative may be replacing the last two lines with:

  ...   /* negate if negative number */                  imul eax, edx   FINISH:   /* EAX is return (u)int value */

126

answered Sep 20 '22 14:09

zx485

I would approach this problem like this:

Initialize the accumulator to 0.
Load the next four characters of the string into an SSE register.
Subtract the value '0' from each character.
Find the first value in the vector whose unsigned value is greater than 9.
If a value was found, shift the components of the vector to the right so that the value found in the previous step was just shifted out.
Load a vector containing powers of ten (1000, 100, 10, 1) and multiply with that.
Compute the sum of all entries in the vector.
Multiply the accumulator with an appropriate value (depending on the number of shifts in step 5) and add the vector. You can use an FMA instruction for that, but I don't know if such an instruction exists for integers.
If no value greater than 9 was found in step four, goto step 2.
Return the accumulator.

You could simplify the algorithm by zeroing out all entries beginning with the wrong one in step 5 instead of shifting and then dividing by an appropriate power of ten in the end.

Please keep in mind that this algorithm reads past the end of the string and is thus not a drop-in replacement for atoi.

answered Sep 18 '22 14:09

fuz

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to implement atoi using SIMD?

Tags:

c++

x86

simd

atoi

sse

the_drow

People also ask

2 Answers

zx485

fuz

Recent Activity

Donate For Us

How to implement atoi using SIMD?

Tags:

c++

x86

simd

atoi

sse

the_drow

People also ask

2 Answers

zx485

fuz

Related questions

Recent Activity

Donate For Us