Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to understand this?

It's from this question.

 gcc -c test.s
 objcopy -O binary test.o test.bin

What's the difference between test.o and test.bin?

.text
    call start
    str:
        .string "test\n"
    start:
    movl    $4, %eax
    movl    $1, %ebx
    pop     %ecx
    movl    $5, %edx
    int     $0x80
    ret

What's the above doing?

like image 305
compile-fan Avatar asked Apr 03 '11 12:04

compile-fan


1 Answers

objcopy -O binary copies the contents of the source file. Here, test.o is a "relocatable object file": that's code, and also a symbol table and relocation information, which allows the file to be linked with other files into an executable program. The test.bin file produced by objcopy contains the code only, no symbol table or relocation information. Such a "raw" file is useless for "normal" programming, but handy for code which has its own loader.

I assume that you use Linux on a 32-bit x86 system. Your test.o file has size 515 bytes. If you try objdump -x test.o you get the following, which describes the contents of the test.o object file:

$ objdump -x test.o

test.o:     file format elf32-i386
test.o
architecture: i386, flags 0x00000010:
HAS_SYMS
start address 0x00000000

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
  0 .text         0000001e  00000000  00000000  00000034  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  00000000  00000000  00000054  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000000  00000000  00000000  00000054  2**2
                  ALLOC
SYMBOL TABLE:
00000000 l    d  .text  00000000 .text
00000000 l    d  .data  00000000 .data
00000000 l    d  .bss   00000000 .bss
0000000b l       .text  00000000 start
00000005 l       .text  00000000 str

This gives you quite a lot of information. In particular, the file contains a section called .text beginning at offset 0x34 in the file (that's 52 in decimal) and of length 0x1e bytes (30 in decimal). You can disassemble it to see the opcodes themselves:

$ objdump -d test.o

test.o:     file format elf32-i386


Disassembly of section .text:

00000000 <str-0x5>:
   0:   e8 06 00 00 00          call   b <start>

00000005 <str>:
   5:   74 65                   je     6c <start+0x61>
   7:   73 74                   jae    7d <start+0x72>
   9:   0a 00                   or     (%eax),%al

0000000b <start>:
   b:   b8 04 00 00 00          mov    $0x4,%eax
  10:   bb 01 00 00 00          mov    $0x1,%ebx
  15:   59                      pop    %ecx
  16:   ba 05 00 00 00          mov    $0x5,%edx
  1b:   cd 80                   int    $0x80
  1d:   c3                      ret    

This is more or less the assembly you started with. The je, jae and or opcodes in the middle are spurious: this is objdump trying to interpret the literal string ("test\n", resulting in the bytes 0x74 0x65 0x73 0x64 0x0a 0x00) as opcodes. objdump -d also shows you the actual bytes found in the .text section, i.e. the bytes in the file beginning at offset 0x34. The first bytes are 0xe8 0x06 0x00...

Now, have a look at the test.bin file. It has length 30 bytes. Let's see those bytes in hexadecimal:

$ hd test.bin
00000000  e8 06 00 00 00 74 65 73  74 0a 00 b8 04 00 00 00  |.....test.......|
00000010  bb 01 00 00 00 59 ba 05  00 00 00 cd 80 c3        |.....Y........|

we recognize here exactly the 30 bytes from the .text section in test.o. That's what objcopy -O binary did: it extracted the file contents, i.e. the only non-empty section, i.e. the raw opcodes themselves, removing everything else, in particular the symbol table and relocation information.

Relocation is about what must be changed in a given piece of code so that it runs properly when stored at a given place in memory. For instance, if the code uses a variable and wishes to obtain the address of that variable, then the relocation information will contain an entry telling to whoever will actually place the code in memory (normally, the linker): "here in the code, when you know where the variable will actually be, write the variable address". Interestingly, the code you show needs no relocation: the sequence of bytes can be written at an arbitrary memory location and executed as is.

Let's have a look at what the code does.

  • The call opcode jumps to the mov instruction at offset 0x0b. Also, since this is a call, it pushes on the stack the return address. The return address is where execution should continue after the call is completed, i.e. when a ret opcode is reached. This is the address of the byte following the call opcode. Here, that address is the address of the first byte of the literal string "test\n".
  • The two movl load %eax and %ebx with numerical values 4 and 1, respectively.
  • The pop opcode removes the top element from the stack, storing it in %ecx. What is this top element ? That's precisely the address pushed on the stack by the call opcode, i.e. the address of the first byte of the literal string.
  • The third movl loads %edx with the numerical value 5.
  • int $0x80 is the system call on 32-bit x86 Linux: this invokes the kernel. The kernel will look at the registers to know what to do. The kernel first looks at %eax to get the "system call number"; on 32-bit x86, "4" is __NR_write, i.e. the write() system call. This call expects three parameters, in registers %ebx, %ecx and %edx, in that order. These are the destination file descriptor (here 1: that's standard output), a pointer to the data to write (here the literal string), and the length of the data to write (here 5, which corresponds to the four letters and the newline character). So this writes "test\n" on standard output.
  • The final ret returns to the caller. ret pops a value from the stack, and jumps to that address. This assumes that this code chunk was invoked with a call opcode.

So, to sum up, the code prints out test with a newline.

Let's try it with a custom loader:

#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>

int
main(void)
{
        void *p;
        int f;

        p = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        f = open("test.bin", O_RDONLY);
        read(f, p, 30);
        close(f);
        mprotect(p, 30, PROT_READ | PROT_EXEC);
        ((void (*)(void))p)();
        return 0;
}

(The code above does not test returned values for errors, which is very bad, of course.)

Here, I allocate a page of memory (4096 bytes) with mmap(), asking for a page where I can read and write. p points to that chunk. Then, with open(), read() and close(), I read the contents of the test.bin file (30 bytes) into that chunk.

The mprotect() call instructs the kernel to change the access rights for my page: for now on, I will want to be able to execute those bytes, i.e. consider them as machine code. I give up the right to write into the chunk (depending on the exact kernel configuration, having a page which can be both written to and executed may be forbidden).

The cryptic ((void (*)(void))p)(); reads as thus: I take p; I cast it as a pointer to a function which takes no argument and returns nothing; I invoke that function. This is C syntax for making a call into my chunk of data.

When I run that program, I get:

$ ./blah
test

which is what was expected: the code in test.bin writes out test on the standard output.

like image 176
Thomas Pornin Avatar answered Sep 20 '22 00:09

Thomas Pornin