Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing an incrementing ASCII decimal counter in video RAM on 7th gen Intel Core

I'm trying to optimize the following subroutine for a specific Kaby Lake CPU (i5-7300HQ), ideally to make the code at least 10 times faster compared to its original form. The code runs as a floppy-style bootloader in 16-bit real mode. It displays a ten digit decimal counter on screen, counting 0 - 9999999999 and then halting.

I have taken a look at Agner's Optimization Guides for Microarchitecture and Assembly, Instruction Performance Table and Intel's Optimization Reference Manual.

Only sensible optimization I've been able to do so far is swapping the loop instruction for dec + jnz, explanation here.

Another possible optimization might be swapping the lodsb for mov + dec, but the information I've found about that has been conflicting, with some saying it helps slightly and others that it might actually hurt the performance on modern CPUs.

I also tried switching to 32-bit mode and keeping the entire counter in an unused register pair to eliminate any memory access, but after reading into it a bit I realized that those ten bits will get cached immediately and the difference in latency between L1 cache and registers is only about a factor of three, so definitely not worth the added overhead of working with the counter in that format.

(editor's note: add reg latency is 1 cycle, add [mem] latency is about 6 cycles, including the 5 cycle store-forwarding latency. Or much worse if [mem] is uncacheable like video RAM.)

org 7c00h

pos equ 2*(2*80-2)  ;address on screen

;init
cli
mov ax,3
int 10h
mov ax,0b800h
mov es,ax
jmp 0:start

start:
    push cs
    pop ds
    std

    mov ah, 4Eh
    xor cx, cx
    mov bl,'9'

countloop:
    mov cl,10           ;number of digits to add to
    mov si,counter+9    ;start of counter
    mov di,pos          ;screen position

    stc                 ;set carry for first adc
next_digit:
    lodsb               ;load digit
    adc al,0
    cmp bl, al
    jnc print
    add al,-10          ;propagate carry if resulting digit > 9
print:
    mov [si+1],al       ;save new digit
    stosw               ;print

    ;replaced loop with a faster equivalent
    ;loop next_digit
    dec cl
    jnz next_digit

    jnc countloop

    jmp $

counter:
    times 10 db '0'

    times 510-($-$$) db 0
    dw 0aa55h

My question is - what can I do to achieve the desired increase in speed? What other materials can I study to gain more understanding of the underlying concepts?

Note: this is a school assignment. While a straight answer would definitely help, I'd much more appreciate explanations or pointers to relevant study material, as we have been given none.

EDIT: Changed code to a minimal reproducible example

like image 971
Dex Avatar asked Apr 27 '20 13:04

Dex


2 Answers

Here's my take on it. The following optimisations have been applied:

  • the least significant digit has been unrolled completely for best performance
  • the remaining digits have been unrolled to one section per digit
  • BCD arithmetic has been used to reduce the code to one conditional branch per digit
  • segment usage has been shuffled around to reduce the number of prefixes used
  • instruction order has been optimised to move long-latency instructions out of the critical path

Additionally I have altered the code to be a COM binary for easier testing. Turning it back into a boot loader is left as an exercise to the reader. One thing you can do once it's a boot loader is fixing the code such that CS and SS have a segment base of 0000. This avoids a penalty on loads and stores on some microarchitectures.

        org     100h

pos     equ     2*(2*80-12)             ; address on screen

        mov     ax, 3                   ; set up video mode
        int     10h
        mov     ax, 0b800h
        mov     ds, ax
        mov     es, ax

        mov     di, pos
        mov     ax, 4e30h               ; '0' + attribute byte 4e
        mov     cx, 10
        cld
        rep     stosw                   ; set up initial display

        xor     ax, ax
        sub     sp, 10
        push    ax
        push    ax
        push    ax
        push    ax
        push    ax
        mov     bp, sp                  ; set up counter

        dec     di
        dec     di                      ; di points to the last digit on screen
        mov     bx, digits              ; translation table

        jmp     countloop

%macro  docarry 1                       ; digits other than the last one
        mov     al, [bp+%1]             ; second to last digit
        inc     ax                      ; add carry to al
        aaa                             ; generate BCD carry
        mov     [bp+%1], al             ; desposit to counter
        cs xlat                         ; generate ASCII digit
        mov     [di-2*9+2*%1], al       ; display digit
        jnc     countloop               ; exit when carry dies
%endm

docarry2:                               ; place this here so jumps are in range
        docarry 2
        docarry 1
        docarry 0
        int     20h

        align   16                      ; for performance
countloop:
        mov     [di], byte '0'          ; treat last digit separately
        mov     [di], byte '1'
        mov     [di], byte '2'
        mov     [di], byte '3'
        mov     [di], byte '4'
        mov     [di], byte '5'
        mov     [di], byte '6'
        mov     [di], byte '7'
        mov     [di], byte '8'
        mov     [di], byte '9'

        docarry 8
        docarry 7
        docarry 6
        docarry 5
        docarry 4
        docarry 3
        jmp     docarry2

digits:
        db      '0123456789'

This increases the speed by a factor of about 30 compared to the original code on my 8 MHz 80286 based machine and manages to increment the counter about 329000 times per second (about 3.04 µs per digit). It's going to be a bit hard to test on a modern system, but I'll try to find a solution.

like image 108
fuz Avatar answered Sep 30 '22 19:09

fuz


If a counter ticks in the forest, does anyone see it?

our requirements state that every single change of a number has to be visible on screen

Your screen's refresh rate is probably 60Hz, maybe as high as 144Hz. Changing video RAM any faster than that will leave some counts unread by the hardware scan out loop over the framebuffer1, never sent to a physical screen, and never turning into a pattern of photons of visible light that a high-speed camera could record.

Footnote 1: Or the virtual equivalent if VGA text mode is emulated somehow on top of hardware that only knows how to draw pixels. Asked Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)? as a followup.

If we don't accept this limit of 1 increment per 16.66.. ms (60 Hz), we need to decide what we're willing to bottleneck on vs. what we can sidestep.

Certainly we need to do the actual work of having the ASCII digits calculated, not just incrementing a binary counter and formatting it into a string occasionally in a timer or vertical blanking interrupt (once per screen refresh). That wouldn't satisfy the spirit of the assignment.

Or what if we calculate the ASCII digits purely in registers and only do mov stores in a timer or vblank interrupt? That would sample the fast-incrementing counter asynchronously from its increments so you'd visually see all the low digits changing. (Which is a pretty clear minimum requirement).

Omitting stores from the actual loop still doesn't feel like it hits the spirit of the assignment. I think our loop should, if run on its own with no fancy hardware setup, truly get every count all the way to video RAM. That seems uncontroversial. That's what the original code does.

The CPU can be configured to do write-combining with MTRRs. Some desktops had a BIOS option to set the AGP GART as UC (UnCacheable) vs. WC (calling it "USWC = Uncacheable Speculative Write Combining"). This BIOS-tuning article has a section on it. It seems modern firmware leaves VGA memory UC, letting OSes / graphics drivers set up MTRRs / PAT.

Unfortunately, making VGA memory WC works too well and the stores never make it out of the CPU core's write-combining buffer. (An LFB since this is an Intel CPU.) We can flush manually after every store with a memory barrier like mfence, or clflushopt with the address of the cache line. But then we're back where we started because on the OP's Kaby Lake iGPU / firmware, it seems that flushing a WC store costs about the same as just doing a UC store costs.

Of course we only have to flush when the whole counter is in sync, after updating all the digits if a carry rippled far. If we were storing each digit separately this could speed us up by 11.111% if I have my math right vs. UC memory. Or if we were doing dword stores of 2 digits at once, by 1.0101% because we only need an extra store every 100 counts, not every 10.

I think we can capture the spirit of the assignment while still letting the hardware optimize our stores by using a WC framebuffer and flushing in a timer or vblank interrupt.

This means we're incrementing a counter very fast (nearly 1 count per core clock cycle with a careful implementation). And we sample that counter by merely using a memory barrier or serializing instruction in an interrupt handler that runs right before the video hardware starts a new pass at the top left of the screen, scanning out a new frame. In fact iret is serializing so merely returning from an empty interrupt handler will do the job. Holding down a key on the keyboard may even make the counter updates visible on screen (where they weren't otherwise) if you used the MTRR to make video RAM WC but didn't program a timer or vertical-blanking interrupt to fire periodically.

Using clflush or mfence from an outer level of the loop wouldn't work well; that would be synchronous with the increments and would thus leave the low digits always zero. It would make the fact that we only sometimes flush explicit in the loop, instead of leaving the flushing as something that happens because of interrupts that are part of normal system operation. (Or at least they would be if this bootloader wasn't literally the only thing running. e.g. if run under DOS you'd have a timer interrupt every few ms.)


If we insist on flushing to video RAM every count (either by leaving it UC, or manually with WC + explicit flushes in the loop), the only optimization that would matter is reducing the number of stores to video RAM. i.e. by not updating digits that aren't changing. The original code stores every digit every time, so fixing that should give very close to a 10x speedup.

Even just storing to uncacheable DRAM or making a PCIe transaction is much slower than anything you could optimize inside the loop, even a self-modifying-code machine clear. And if storing to a VGA text framebuffer triggers a System Management Mode Interrupt (SMI) to emulate text mode by updating a real pixel framebuffer, the cost of a store to the frame is astronomical compared to anything else you could do in the loop. This might well be how firmware for out Skylake / Kaby Lake integrated GPUs works: Does modern PC video hardware support VGA text mode in HW, or does the BIOS emulate it (with System Management Mode)?

Allowing the hardware to do write-combining on our stores to VRAM is thus essential to making this optimization problem interesting beyond that one algorithmic tweak.

To do this, program the MTRR for the VGA framebuffer. https://wiki.osdev.org/MTRR documents the actual MSRs you can use with the wrmsr instruction. I think each MSR has a bit-field of 8 regions. The one you want is IA32_MTRR_FIX16K_A0000, in MSR[259] - 8 regions of 16 KB each (128 KB total) which include the linear address block B8000 that holds VGA text-mode memory. Figure 11-8 in Intel's SDM vol 3 documents the layout.


Assuming WC video memory (or for updating WB cacheable memory)

There are lots of things to improve, but two critical things:

  • Micro-architectural: Self-modifying code pipeline nukes, aka machine clears, from count[] being in the same 64B cache line as your main loop (~50x performance with no other changes.) Without changing this, it's hard to see any gains from any other micro-optimizations.

  • Algorithmic: Don't blindly propagate carry all the way up through every digit every time: 90% of the increments don't carry at all, 99% carry only 1 place, etc. Nested loops to handle the low digits can run very efficiently, just incrementing their own digit counter and having the outer loop reset it to '0', no need to explicitly propagate those carries with adc. Keeping those ASCII digits in registers also avoids the need to load/store them to counts[], just pure stores to video RAM, like mov [di-4], eax.

    With very efficient inner loops for the low digits, performance of the upper 6 or 7 digits becomes nearly irrelevant. That part runs once per 10k or 1k increments so its cost is amortized. (~19x speedup for aggressively optimized inner loops vs. a micro-optimized version of your original loop that saves some uops and avoids some bottlenecks without changing the algorithm.)

Other micro-optimizations of your original (after fixing the SMC machine clears) gave a factor of ~1.5x speedup: making the carry branch normally not-taken, saving some uops, avoiding some partial-register false dependencies from lodsb and writing 16-bit partial registers.

With the optimized 4 levels of inner loops I rewrote from scratch, my version is about 29x faster on Skylake / Kaby Lake than the no-SMC-stall version of the original, or ~1500x faster than the true original. There's certainly a middle ground where you do adc carry propagation but take an early out when CF==0; I didn't try to implement that.

Tested in 32-bit mode, but the same code assembled for 16-bit mode should execute the same way, including the SMC stalls in your original. (Assuming WC stores don't trigger an SMI until flushed, and that the WC buffer keeps the stores local inside the core so ~1 store / clock is possible just like with WB memory.)

SKL and KBL are clock-for-clock identical in perf, same microarchitecture, so my test results should be reproducible for you. I did assemble your code in 16-bit mode to see alignment: it looks like your loop will have some bytes of count[] in the same 64-byte cache line as the end of the loop, hence a SMC pipeline nuke per iteration for most digits.


I adapted your original code so I could run the same loop in 32-bit mode under Linux, making it possible to use perf to profile with HW performance counters. The first step in optimizing anything is to get a baseline measurement. Since you mention some micro-optimizations for the micro-architectural reasons, we want perf counters not just total time. We can't easily get that in a bootloader on bare metal. Possibly in a guest VM, but then you'd be storing to a virtual VGA device, not real hardware, so it's probably not different from using normal or NT stores on normal WB memory in user-space under Linux.

perf stat -I1000 to show counters for amount of work done every second is a handy way to compare speed for tweaks that don't change the algorithm or number of branches. Look at counts for branches in 1 second to see relative speed of the loop, or divide that by cycles.

I used movnti to try to simulate a store to WC video RAM (uncacheable speculative Write-Combining, instead of normal WB = write-back cacheable). I think normal stores to WC memory regions behave like movnt stores. movnt stores that don't complete a cache line can keep updating the same write-combining LFB without actually flushing to memory. So it's similar to a normal store to WB memory that can hit in L1d cache.

SMI trapping of framebuffer stores (if done at all) is done by hardware outside the CPU core, probably the System Agent, so it doesn't fire until the core flushes. Or if no SMI trap, then probably it just goes to DRAM on our iGPU systems. Or over a PCIe bus to get to video RAM on a separate card.


Versions timed under GNU/Linux kernel 5.5.10 on i7-6700k on a somewhat idle system at ~4.2GHz

DRAM and cache are barely involved, and the system was idle enough that nothing was taking cycles on the other logical core of the physical core, so the code had a whole CPU to itself the whole time to spam stores into a write-combining buffer.

  • Original version, ported to run in 32-bit user-space: Godbolt - not fully timed, but perf stat -I1000 to print stats per second shows it's running about 52x slower than with align 64 before counter:. The pipeline nuke may include flushing WC buffers which would mean going to DRAM as well.
  • Original version, with SMC pipeline nukes avoided: ~85.7 seconds, ~358 billion core clock cycles for 10^10 counts. 2.66 IPC
  • Micro-optimized version of that: Godbolt - ~55.3 seconds, ~231 billion clock cycles for 10^10 counts. 4.56 IPC (but with more simpler instructions, not lodsb)
  • New inner loops, empty placeholder outer loop: Godbolt - ~2.93 seconds, ~12.25 billion core clock cycles. 2.73 IPC

The optimized version achieves close to 3 stores per 4 clocks. (Counting the low 2 digits from 00..99 takes 100 stores, the way it does it. I didn't time these final versions with clflushopt.)


If you fixed some of the stalls and stopped your loop with CF==0, this would result in a bottleneck on store/reload (store-forwaring) latency to low element of the count array. You definitely want those in registers so they can be store-only, not load/adc/store.

TODO: comment and talk about the micro-optimizations that I applied for that version:

  • Why doesn't GCC use partial registers? / How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent - also lodsb sucks. lodsd/ q are ok. Use movzx to do narrow loads, instead of merging into the low byte. Fortunately inc/dec in an adc loop on Sandybridge-family is fine, not causing partial-flag stalls like it would on P6-family. Especially in Skylake which doesn't do flag-merging at all, instead just reading the CF and/or SPAZO parts of FLAGS separately if needed. (Consequence: cmovbe and cmova are 2 uops to read 2 integer inputs and CF + ZF; other cmov are only 1 uop.)

  • You can use 32-bit registers in 16-bit mode, you don't have to switch modes. The assembler just uses an operand-size prefix. Writing a 32-bit register has no dependency on the old value, but 16 or 8 does. I used this to break dependency chains that would otherwise be loop-carried, allowing the CPU to exploit the instruction-level parallelism (ILP) across loop iterations / http://www.lighterra.com/papers/modernmicroprocessors/.

  • Haswell/Skylake have taken branch throughput of 1/clock, but can run a not-taken and a taken in the same cycle. Lay out branches to favour not-taken on the fast path (always a good idea in general).

  • Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? - adc al,0 is unfortunately 2 uops on Skylake, unlike adc eax,0 or adc bl,0. Crazy, right? This is basically a CPU performance bug or CPU missed-optimization by the hardware designers, where the special-case opcodes for smaller encodings decode worse.

  • 32-byte aligned routine does not fit the uops cache - Intel's recent JCC erratum makes the idq.mite_uops perf event worth checking. Skylake used to be pretty robust against code alignment, but now it's horrible for high-throughput code.

    Perf doesn't totally fall off a cliff, but a significant factor is possible because of front-end bottlenecks from having to use legacy decode for some 32-byte blocks of machine code that end with a jcc on a 32-byte boundary. I didn't spend a lot of effort on this optimization for this code, but the fast versions happen to avoid this problem according to perf counters.

My version with nested loops, testable on GNU/Linux

This is only the inner loop; the outer loop just repeats it 10^10 / 10k times with no actual outer-loop work. We only leave the inner 4 loops once per 10k increments so pretending that part takes zero time doesn't change the result particularly.

The same pattern of 2 nested levels of looping per register could be repeated more times, or just do a chain of adc like you were doing.

;; nasm -felf32 decimal-counter.asm
;; ld -N -melf_i386 -o decimal-counter decimal-counter.o
;; writeable text segment like a bootloader
;; runs in 32-bit mode with prefixes for 16-bit operand-size
;;
;; taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,resource_stalls.any:u,rs_events.empty_cycles:u,machine_clears.count:u -I1000 ./decimal-counter

%use smartalign
alignmode p6, 64

;org 7c00h

;pos equ vram + 2*(2*80-2)  ;address on screen
pos equ vram + 2*(2*80-4)  ;address on screen

    ; In GDB, use
    ; p ((char*)&vram) + 2*(2*80-4)-36

;init
;cli
;mov ax,3
;int 10h
;mov ax,0b800h
;mov es,ax
;jmp 0:start


 ; pick your poison, or let stores stay in the CPU, not reaching VRAM
%macro FLUSH 1
 ;  clflushopt %1           ; all the way to DRAM
 ;  mfence                  ; for mov to WB: just drain store buffer.  For WC or movnt, IDK how guaranteed it is to hit DRAM
;   lock xor byte [esp], 0   ; faster version of mfence (at least on Skylake)
%endmacro
;%define movnti mov         ; for experiments

global _start
align 512
_start:
;    push cs
;    pop ds
;    mov ebp, counter+9    ; save address in a register
;    mov edi,pos
    mov edi, pos - 10*4
    mov eax, '0_0_'
    mov ecx, 10
    rep stosw                   ; memset the digits in VRAM

    mov  ebp, 10000000000 / 10000     ; outer loop iterations
    mov edi, pos-4

;    mov ah, 4Eh         ; VGA attribute byte
;    mov eax, '____'

align 32
.outer:

    mov  edx, '0_0_'           ; thousands (low), hundreds (high) digits
.thousands:
 .hundreds:
    movnti  [edi-4], edx
    ; don't want to flush yet; only after low digits are updated
    add  edx, 1<<16

    mov  eax, '0_0_'            ; tens (low=AX), ones (high) digits
    .tens:
        .ones:                  ; do{
          movnti  [edi], eax         ; store low 2 digits
        FLUSH [edi]
          lea  ecx, [eax + (1<<16)]       ; off the critical path of the EAX dep chain
          movnti  [edi], ecx
        FLUSH [edi]
          add  eax, 2<<16               ; unroll by 2
          cmp  eax, '9_'<<16
          jle  .ones            ; }while(ones<='9')
                   ; mov byte [edi+2], '9'    ; peel the last 2 iterations?

        add  eax, ('1_0_') - ('0_0_' + (10<<16))     ; increment the more-significant digit (AL), resetting less-significant digit back to '0'
        cmp  al, '9'
        jle  .tens

    cmp  edx, '9_9_'
    jle  .hundreds

    add  edx, ('1_0_') - ('0_0_' + (10<<16))     ; increment the more-significant digit (DL), resetting less-significant digit back to '0'
    cmp  dl, '9'
    jle  .thousands

;; TODO: increment the high 6 digits, propagating carry.  Possibly clflushopt here only?
;    pause
    dec ebp
    jnz .outer
    ;    jmp $
    mov eax, 1
    int 0x80


;section .data   ; avoids machine clears
    ; in original 16-bit code: counter starts at 00000037 30<rept>, ends at 00000040 (inclusive), in same cache line as the loop
align 64
counter:
    times 10 db '0'
;section .text

    times 510-($-$$) db 0
    dw 0aa55h

section .bss
vram:   resw 80*25

I have tested that this works for the low digits, single-stepping it in GDB and using display ((char*)&vram) + 2*(2*80-4)-36 or something like that to show the contents of that part of BSS as a string every step.

Using dword stores means that when the ones place, wraps we don't need a separate store to update the tens place. It just has to update the low byte of the same register and let the first iteration of the inner loop do that store.

During rollover from 0099 to 0100, memory contents are temporarily 0199. But unless you use SSE to store 16 bytes at once, you can't really avoid one problem or the other. The other option would be to somehow arrange for 0000 before 0100, but that might waste a store to the tens/ones dword in the hundreds loop.

like image 35
Peter Cordes Avatar answered Sep 30 '22 17:09

Peter Cordes