Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

static relocation about c

Tags:

c

linker

Every machine code about text section in object file have address, it will from 0 to a number.

When the linker links all object files, the address about instruction will change.

I can't see if the linker will read instruction about text section one by one in order to change every instruction address.

Disassembly of section .text:

00000000 <_start>:

    0:  bf 00 00 00 00          mov    $0x0,%edi
    5:  8b 04 bd 00 00 00 00    mov    0x0(,%edi,4),%eax
    c:  89 c3                   mov    %eax,%ebx

by linked

08048074 <_start>:

    8048074:    bf 00 00 00 00          mov    $0x0,%edi
    8048079:    8b 04 bd a0 90 04 08    mov    0x80490a0(,%edi,4),%eax
    8048080:    89 c3                   mov    %eax,%ebx

just like 0 → 8048074 and so on.

like image 422
Tianxin Avatar asked Jun 13 '16 07:06

Tianxin


1 Answers

Alright so I'm assuming you're using some unix based system as this seems like the output of objdump command, yet what as much as I know this is relevant for both ELF and PE files.

so let's start, firstly when you use c you compile some models into object files and eventually link them together, as seen earlier. e.g:

  • m1.c -> m1.o
  • m2.c -> m2.o
  • main.c + m2.o + m1.o -> main.exe

we have some c programs called m1.c / m2.c that defines some functions, which are called by main.c, eventually all are linked and compiled together into main.exe which is fully executable.

now, let's dive in and see what happened under the hood. firstly I'd like to start with a very important beginning, within the final executable, in our example (main.exe) all addresses are FULLY RESOLVED VIRTUAL ADDRESSES (this is not necessarily true because of some concept called PIE / PIC but for now let's not get into it )

hence within you're executable, function foo within m1.o would have some resolved address (e.g 0x400100), within main.exe when foo is called you'll see within the disassembly something such as

call 0x400100

now this is what conceptually happens, now let's get into what actually happens. when fetching instructions, e.g jmp or call instruction some address is given as an operand and then your processor's instruction register is changed to the address given as operand, so your question is smart, should the linker go instruction by instruction, find which are in need to be changed and change it? well NO, the linker simply doesn't do that, it is much smarter than that.

firstly, when compiling, the compiler generates jumps and calls to inner modules ( for example jmp to some address that should already belong within m1.o in our example ) relative to current instruction executing. what does that mean? let's say we have some if statement, that would be compiled to jumps to some addresses, the compiler is smart enough to use a relative jump operand and place the offset between commands, hence when linking the linker doesn't even have to change those, it's irrelevant to which address the code is loaded as the calls are relative to current instruction and the offset between commands of some object file stays static through the linking stage.

now here's where things get a little bit more complex, we've covered how the linker avoids changing addresses within m1.o, now what if m2.o calls functions defined in m1.o both are executables and there is no way on earth that the compiler can assume the offset between them as they both have no idea with how many other models they would be linked in, how is this solved? Symbol and Relocation tables are introduced.

  • Symbol Table - A table containing all symbols within your model - a symbol is something that other models may need to recognize by name, such as functions and global variables.
  • Relocation Table - A table containing all "occurrences" of the symbols within some model.

You've may heard of these before, but now I shall explain to you about these. before getting into it, I need to warn that I'm more familiar with ELF format files but as much as I know is that conceptually PE files work the same way.

let's look at this example code

#include <stdio.h>
/** file: m1.c **/

extern void goo();

void foo()
{
  printf("I am foo()!\n");
  goo();
}

and

#include <stdio.h>
/** file: m2.c **/

void goo()
{
  printf("I am goo()!\n");
}

when compiling m1.o within the object file, there would be some table saying something like this

SYMBOLS: foo-> at offset X within file, goo-> UNDEFINED RELOCATION: goo-> at offset Y within file,

now what this means is that the compiler generates a table that collects all the functions that the model uses and determines whether they are defined - it gives the offset that the function is defined within filed, and if it's not defined it would state it,

also it would state that within this model goo is being called at offset X and it needs to be relocated (We'll get to my point, It's the answer to your question!)

when linking into an executable, the linker takes all symbols of all object files, resolves some address within them, and then goes through each symbol table of each object file, looks and determines which symbols are yet undefined, then it goes through the relocation table and looks at which calls are made to symbols that were undefined, goes that place within the file, and simply re-writes the address that was called to the address resolved, so if before we had something such as this in m1.o

call 0x000000 ;undefined goo address

after symbol resolving, linker would probably have some entry on relocation table saying you need to relocate goo address on line X and we'll result in

call 0x400100 ;actual goo address

FYI, when having an undefined reference linker error it means that you have some undefined symbol within your symbol table and the linker can't resolve a matching function definition for it... also if I have not made myself clear, this works exactly the same for global and static variables, they are also considered to be symbols

like image 180
DrPrItay Avatar answered Oct 07 '22 12:10

DrPrItay