Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what does compiler do with a[i] which a is array? And what if a is a pointer?

I was told by c-faq that compiler do different things to deal with a[i] while a is an array or a pointer. Here's an example from c-faq:

char a[] = "hello";
char *p = "world";

Given the declarations above, when the compiler sees the expression a[3], it emits code to start at the location ``a'', move three past it, and fetch the character there. When it sees the expression p[3], it emits code to start at the location ``p'', fetch the pointer value there, add three to the pointer, and finally fetch the character pointed to.

But I was told that when dealing with a[i], the compiler tends to convert a (which is an array) to a pointer-to-array. So I want to see assembly codes to find out which is right.

EDIT:

Here's the source of this statement. c-faq And note this sentence:

an expression of the form a[i] causes the array to decay into a pointer, following the rule above, and then to be subscripted just as would be a pointer variable in the expression p[i] (although the eventual memory accesses will be different, "

I'm pretty confused of this: since a has decayed to pointer, then why does he mean about "memory accesses will be different?"

Here's my code:

// array.cpp
#include <cstdio>
using namespace std;

int main()
{
    char a[6] = "hello";
    char *p = "world";
    printf("%c\n", a[3]);
    printf("%c\n", p[3]);
}

And here's part of the assembly code I got using g++ -S array.cpp

    .file   "array.cpp" 
    .section    .rodata
.LC0:
    .string "world"
.LC1:
    .string "%c\n"
    .text
.globl main
    .type   main, @function
main:
.LFB2:
    leal    4(%esp), %ecx
.LCFI0:
    andl    $-16, %esp
    pushl   -4(%ecx)
.LCFI1:
    pushl   %ebp
.LCFI2:
    movl    %esp, %ebp
.LCFI3:
    pushl   %ecx
.LCFI4:
    subl    $36, %esp
.LCFI5:
    movl    $1819043176, -14(%ebp)
    movw    $111, -10(%ebp)
    movl    $.LC0, -8(%ebp)
    movzbl  -11(%ebp), %eax
    movsbl  %al,%eax
    movl    %eax, 4(%esp)
    movl    $.LC1, (%esp)
    call    printf
    movl    -8(%ebp), %eax
    addl    $3, %eax
    movzbl  (%eax), %eax
    movsbl  %al,%eax
    movl    %eax, 4(%esp)
    movl    $.LC1, (%esp)
    call    printf
    movl    $0, %eax
    addl    $36, %esp
    popl    %ecx
    popl    %ebp
    leal    -4(%ecx), %esp
    ret 

I can not figure out the mechanism of a[3] and p[3] from codes above. Such as:

  • where was "hello" initialized?
  • what does $1819043176 mean? maybe it's the memory address of "hello" (address of a)?
  • I'm sure that "-11(%ebp)" means a[3], but why?
  • In "movl -8(%ebp), %eax", content of poniter p is stored in EAX, right? So $.LC0 means content of pointer p?
  • What does "movsbl %al,%eax" mean?
  • And, note these 3 lines of codes:
    movl $1819043176, -14(%ebp)
    movw $111, -10(%ebp)
    movl $.LC0, -8(%ebp)

    The last one use "movl" but why did not it overwrite the content of -10(%ebp)? (I know the anser now :), the address is incremental and "movl $.LC0 -8(%ebp) will only overwrite {-8, -7, -6, -5}(%ebp))

I'm sorry but I'm totally confused of the mechanism, as well as assembly code...

Thank you very much for your help.

like image 827
ibread Avatar asked Jan 15 '10 16:01

ibread


2 Answers

a is a pointer to an array of chars. p is a pointer to a char which happens to, in this case, being pointed at a string-literal.

movl    $1819043176, -14(%ebp)
movw    $111, -10(%ebp)

Initializes the local "hello" on the stack (that's why it is referenced through ebp). Since there are more than 4bytes in "hello", it takes two instructions.

movzbl  -11(%ebp), %eax
movsbl  %al,%eax

References a[3]: the two step process is because of a limitation in terms of access to the memory referenced though ebp (my x86-fu is a bit rusty).

movl -8(%ebp), %eax does indeed reference the p pointer.

LC0 references a "relative memory" location: a fixed memory location will be allocated once the program is loaded in memory.

movsbl %al,%eax means: "move single byte, lower" (give or take... I'd have to look it up... I am a bit rusty on this front). al represent a byte from the register eax.

like image 191
jldupont Avatar answered Sep 21 '22 00:09

jldupont


Getting on the language side of this, since the assembler side has already been handled:

Note this sentence: " an expression of the form a[i] causes the array to decay into a pointer, following the rule above, and then to be subscripted just as would be a pointer variable in the expression p[i] (although the eventual memory accesses will be different, " I'm pretty confused of this: since a has decayed to pointer, then why does he mean about "memory accesses will be different?

This is because after decaying, access is equal for the (now a pointer value) and the pointer. But the difference is how that pointer value is got in the first place. Let's look at an example:

char c[1];

char cc;
char *pc = &cc;

Now, you have an array. This array does not take any storage other than one char! There is no pointer stored for it. And you have a pointer that points to a char. The pointer takes the size of one address, and you have one char that the pointer points to. Now let's look what happens for the array case to get the the pointer value:

c[0] = 'A';
// #1: equivalent: *(c + 0) = 'A';
// #2: => 'c' appears not in address-of or sizeof 
// #3: => get address of "c": This is the pointer value P1

The pointer case is different:

pc[0] = 'A';
// #1: equivalent: *(pc + 0) = 'A';
// #2: => pointer value is stored in 'pc'
// #3: => thus: read address stored in 'pc': This is the pointer value P1

As you see, for the array case for getting the pointer value needed where we add the index value to (in this case a boring 0), we don't need to read from memory, because the address of the array is already the pointer value needed. But for the pointer case, the pointer value we need is stored in the pointer: We need one read from memory to get that address.

After this, the path is equal for both:

// #4: add "0 * sizeof(char)" to P1. This is the address P2
// #5: store 'A' to address P2

Here is the assembler code generated for the array and the pointer case:

        add     $2, $0, 65  ; write 65 into r2
        stb     $2, $0, c   ; store r2 into address of c
# pointer case follows
        ldw     $3, $0, pc  ; load value stored in pc into r3
        add     $2, $0, 65  ; write 65 into r2
        stb     $2, $3, 0   ; store r2 into address loaded to r3

We can just store 65 (ASCII for 'A') at the address of c (which will be known already at compile or link time when it is global). For the pointer case, we will first have to load the address stored by it into register 3, and then write the 65 to that address.

like image 26
Johannes Schaub - litb Avatar answered Sep 20 '22 00:09

Johannes Schaub - litb