Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are functions encoded/stored in memory?

I understand how things like numbers and letters are encoded in binary, and thus can be stored as 0's and 1's.

But how are functions stored in memory? I don't see how they could be stored as 0's and 1's, and I don't see how something could be stored in memory as anything besides 0's and 1's.

like image 385
Adam Zerner Avatar asked Aug 15 '14 23:08

Adam Zerner


1 Answers

They are in fact stored into memory as 0's and 1's

Here is a real world example:

int func(int a, int b) {
    return (a + b);
}

Here is an example of 32-bit x86 machine instructions that a compiler might generate for the function (in a text representation known as assembly code):

func:
        push    ebp
        mov     ebp, esp
        mov     edx, [ebp+8]
        mov     eax, [ebp+12]
        add     eax, edx
        pop     ebp
        ret

Going into how each of these instructions work is beyond the scope of this question, but each one of these symbols (such as add, pop, mov, etc) and their parameters are encoded into 1's and 0's. This table shows many of the Intel instructions and a summary of how they are encoded. See also the x86 tag wiki for links to docs/guides/manuals.


So how does one go about converting code from text assembly into machine-readable bytes (aka machine code)? Take for example, the instruction add eax, edx. This page shows how the add instruction is encoded. eax and edx are something called registers, spots in the processor used to hold information for processing. Variables in computer programming will often map to registers at some point. Because we are adding registers and the registers are 32-bit, we select the opcode 000000001 (see also Intel's official instruction-set reference manual entry for ADD, which lists all the forms available).

The next step is for specifying the operands. This section of the same previous page shows how this is done with the example "add ecx, eax" which is very similar to our own. The first two bits have to be '11' to show we are adding registers. The next 3 bits specifies the first register, in our case we pick edx rather than the eax in their example, which leaves us with '100'. The next 3 bits specifies our eax, so we have a final result of

00000001 11100000

Which is 01 D0 in hexadecimal. A similar process can be applied to converting any instruction into binary. The tool used to do this automatically is called an assembler.


So, running the above assembly code through an assembler produces the following output:

66 55 66 89 E5 66 67 8B 55 O8 66 67 8B 45 0C 66 01 D0 66 5D C3

Note the 01 D0 near the end of the string, this is our "add" instruction. Converting machine-code bytes back into text assembly-language mnemonics is called disassembling:

 address | machine code  |  disassembly
   0:      55              push   ebp
   1:      89 e5           mov    ebp, esp
   3:      8b 55 08        mov    edx, [ebp+0x8]
   6:      8b 45 0c        mov    eax, [ebp+0xc]
   9:      01 d0           add    eax, edx
   b:      5d              pop    ebp
   c:      c3              ret    

Addresses start at zero because this is only a .o, not a linked binary. So they're just relative to the start of the file's .text section.

You can see this for any function you like on the Godbolt Compiler Explorer (or on your own machine on any binary, freshly-compiled or not, using a disassembler).


You may notice there is no mention of the name "func" in the final output. This is because in machine code, a function is referenced by its location in RAM, not its name. The compiler-output object file may have a func entry in its symbol table referring to this block of machine code, but the symbol table is read by software, not something the CPU hardware can decode and run directly. The bit-patterns of the machine code are seen and decoded directly by transistors in the CPU.

Sometimes it is hard for us to understand how computers encode instructions like this at a low level because as programmers or power users, we have tools to avoid ever dealing with them directly. We rely on compilers, assemblers, and interpreters to do the work for us. Nonetheless, anything a computer ever does must eventually be specified in machine code.

like image 154
Dougvj Avatar answered Nov 14 '22 19:11

Dougvj