How can I prevent functions from being aligned to 16 bytes boundary when compiling for X86?

Tags:

I'm working in an embedded-like environment where each byte is extremely precious, much more so than additional cycles for unaligned accesses. I have some simple Rust code from an OS development example:

#![feature(lang_items)]
#![no_std]
extern crate rlibc;
#[no_mangle]
pub extern fn rust_main() {

    // ATTENTION: we have a very small stack and no guard page

    let hello = b"Hello World!";
    let color_byte = 0x1f; // white foreground, blue background

    let mut hello_colored = [color_byte; 24];
    for (i, char_byte) in hello.into_iter().enumerate() {
        hello_colored[i*2] = *char_byte;
    }

    // write `Hello World!` to the center of the VGA text buffer
    let buffer_ptr = (0xb8000 + 1988) as *mut _;
    unsafe { *buffer_ptr = hello_colored };

    loop{}

}

#[lang = "eh_personality"] extern fn eh_personality() {}
#[lang = "panic_fmt"] #[no_mangle] pub extern fn panic_fmt() -> ! {loop{}}

I also use this linker script:

OUTPUT_FORMAT("binary")
ENTRY(rust_main)
phys = 0x0000;
SECTIONS
{
  .text phys : AT(phys) {
    code = .;
    *(.text.start);
    *(.text*)
    *(.rodata)
    . = ALIGN(4);
  }
  __text_end=.;
  .data : AT(phys + (data - code))
  {
    data = .;
    *(.data)
    . = ALIGN(4);
  }
  __data_end=.;
  .bss : AT(phys + (bss - code))
  {
    bss = .;
    *(.bss)
    . = ALIGN(4);
  }
  __binary_end = .;
}

I optimize it with opt-level: 3 and LTO using an i586 targeted compiler and the GNU ld linker, including -O3 in the linker command. I've also tried opt-level: z and a coupled -Os at the linker, but this resulted in code that was bigger (it didn't unroll the loop). As it stands, the size seems pretty reasonable with opt-level: 3.

There are quite a few bytes that seem wasted on aligning functions to some boundary. After the unrolled loop, 7 nop instructions are inserted and then there is an infinite loop as expected. After this, there appears to be another infinite loop that is preceded by 7 16-bit override nop instructions (ie, xchg ax,ax rather than xchg eax,eax). This adds up to about 26 bytes wasted in a 196 byte flat binary.

What exactly is the optimizer doing here?
What options do I have to disable it?
Why is unreachable code being included in the binary?

The full assembly listing below:

   0:   c6 05 c4 87 0b 00 48    movb   $0x48,0xb87c4
   7:   c6 05 c5 87 0b 00 1f    movb   $0x1f,0xb87c5
   e:   c6 05 c6 87 0b 00 65    movb   $0x65,0xb87c6
  15:   c6 05 c7 87 0b 00 1f    movb   $0x1f,0xb87c7
  1c:   c6 05 c8 87 0b 00 6c    movb   $0x6c,0xb87c8
  23:   c6 05 c9 87 0b 00 1f    movb   $0x1f,0xb87c9
  2a:   c6 05 ca 87 0b 00 6c    movb   $0x6c,0xb87ca
  31:   c6 05 cb 87 0b 00 1f    movb   $0x1f,0xb87cb
  38:   c6 05 cc 87 0b 00 6f    movb   $0x6f,0xb87cc
  3f:   c6 05 cd 87 0b 00 1f    movb   $0x1f,0xb87cd
  46:   c6 05 ce 87 0b 00 20    movb   $0x20,0xb87ce
  4d:   c6 05 cf 87 0b 00 1f    movb   $0x1f,0xb87cf
  54:   c6 05 d0 87 0b 00 57    movb   $0x57,0xb87d0
  5b:   c6 05 d1 87 0b 00 1f    movb   $0x1f,0xb87d1
  62:   c6 05 d2 87 0b 00 6f    movb   $0x6f,0xb87d2
  69:   c6 05 d3 87 0b 00 1f    movb   $0x1f,0xb87d3
  70:   c6 05 d4 87 0b 00 72    movb   $0x72,0xb87d4
  77:   c6 05 d5 87 0b 00 1f    movb   $0x1f,0xb87d5
  7e:   c6 05 d6 87 0b 00 6c    movb   $0x6c,0xb87d6
  85:   c6 05 d7 87 0b 00 1f    movb   $0x1f,0xb87d7
  8c:   c6 05 d8 87 0b 00 64    movb   $0x64,0xb87d8
  93:   c6 05 d9 87 0b 00 1f    movb   $0x1f,0xb87d9
  9a:   c6 05 da 87 0b 00 21    movb   $0x21,0xb87da
  a1:   c6 05 db 87 0b 00 1f    movb   $0x1f,0xb87db
  a8:   90                      nop
  a9:   90                      nop
  aa:   90                      nop
  ab:   90                      nop
  ac:   90                      nop
  ad:   90                      nop
  ae:   90                      nop
  af:   90                      nop
  b0:   eb fe                   jmp    0xb0
  b2:   66 90                   xchg   %ax,%ax
  b4:   66 90                   xchg   %ax,%ax
  b6:   66 90                   xchg   %ax,%ax
  b8:   66 90                   xchg   %ax,%ax
  ba:   66 90                   xchg   %ax,%ax
  bc:   66 90                   xchg   %ax,%ax
  be:   66 90                   xchg   %ax,%ax
  c0:   eb fe                   jmp    0xc0
  c2:   66 90                   xchg   %ax,%ax

603

asked Jul 17 '17 04:07

Earlz

1 Answers

As Ross states, aligning functions and branch points to 16 bytes is a common x86 optimization recommended by Intel, although it can occasionally be less efficient, such as in your case. For a compiler to optimally decide whether or not to align is a hard problem, and I believe LLVM simply opts to always align. See more info on Performance optimisations of x86-64 assembly - Alignment and branch prediction.

As red75prime's comment hints (but doesn't explain), LLVM uses the value of the align-all-blocks as the byte alignment for branch points, so setting it to 1 will disable alignment. Note that this applies globally, and that comparison benchmarks are recommended.

174

answered Oct 05 '22 10:10

bug

Related questions
                            
                                Why does C++ inline function has call instructions?
                            
                                Assembly: Difference between add instruction and operator add
                            
                                IDIV operation in assembly (understanding)
                            
                                c++ difference between reinterpret cast and c style cast
                            
                                What does .align in ARM architecture
                            
                                How to measure efficiency (in particular: assembly code) for java programs?
                            
                                Self-modifying code sees a 0xCC byte but the debugger doesn't show it?
                            
                                Why is 0 moved to stack when using return value?
                            
                                What arguments are passed to entry point of a PE (Portable Executable) file?
                            
                                Disabling Paging in x86 32bit
                            
                                Loading second stage of a bootloader
                            
                                Bootloader printing garbage on real hardware [duplicate]
                            
                                Meaning of bytes in Intel GMA950 private buffer, in VGA text mode
                            
                                AVX512 vector length and SAE control
                            
                                Which is more useful at an assembly level, 64 registers or three operand instructions? [closed]
                            
                                How to execute a call instruction with a 64-bit absolute address?
                            
                                How to synchronize on ARM when one thread is writing code which the other thread may be executing concurrently?
                            
                                How do I decode a machine instruction to assembly in LEGv8?
                            
                                Whats the fundamental difference between addressing of array[di] and [array + di] in assembly?
                            
                                GCC Assembly "+t"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I prevent functions from being aligned to 16 bytes boundary when compiling for X86?

Tags:

optimization

x86

assembly

embedded

rust

Earlz

People also ask

1 Answers

bug

Recent Activity

Donate For Us