So recently i was thinking about strcpy and back to K&R where they show the implementation as
while (*dst++ = *src++) ;
However I mistakenly transcribed it as:
while (*dst = *src)
{
src++; //technically could be ++src on these lines
dst++;
}
In any case that got me thinking about whether the compiler would actually produce different code for these two. My initial thought is they should be near identical, since src and dst are being incremented but never used I thought the compiler would know not to try to acually preserve them as "variables" in the produced machine code.
Using windows7 with VS 2010 C++ SP1 building in 32 bit Release mode (/O2), I got the dis-assembly code for both of the above incarnations. To prevent the function itself from referencing the input directly and being inlined i made a dll with each of the functions. I have omitted the prologue and epilogue of the produced ASM.
while (*dst++ = *src++)
6EBB1003 8B 55 08 mov edx,dword ptr [src]
6EBB1006 8B 45 0C mov eax,dword ptr [dst]
6EBB1009 2B D0 sub edx,eax //prepare edx so that edx + eax always points to src
6EBB100B EB 03 jmp docopy+10h (6EBB1010h)
6EBB100D 8D 49 00 lea ecx,[ecx] //looks like align padding, never hit this line
6EBB1010 8A 0C 02 mov cl,byte ptr [edx+eax] //ptr [edx+ eax] points to char in src :loop begin
6EBB1013 88 08 mov byte ptr [eax],cl //copy char to dst
6EBB1015 40 inc eax //inc src ptr
6EBB1016 84 C9 test cl,cl // check for 0 (null terminator)
6EBB1018 75 F6 jne docopy+10h (6EBB1010h) //if not goto :loop begin
;
Above I have annotated the code, essentially a single loop , only 1 check for null and 1 memory copy.
Now lets look at my mistake version:
while (*dst = *src)
6EBB1003 8B 55 08 mov edx,dword ptr [src]
6EBB1006 8A 0A mov cl,byte ptr [edx]
6EBB1008 8B 45 0C mov eax,dword ptr [dst]
6EBB100B 88 08 mov byte ptr [eax],cl //copy 0th char to dst
6EBB100D 84 C9 test cl,cl //check for 0
6EBB100F 74 0D je docopy+1Eh (6EBB101Eh) // return if we encounter null terminator
6EBB1011 2B D0 sub edx,eax
6EBB1013 8A 4C 02 01 mov cl,byte ptr [edx+eax+1] //get +1th char :loop begin
{
src++;
dst++;
6EBB1017 40 inc eax
6EBB1018 88 08 mov byte ptr [eax],cl //copy above char to dst
6EBB101A 84 C9 test cl,cl //check for 0
6EBB101C 75 F5 jne docopy+13h (6EBB1013h) // if not goto :loop begin
}
In my version, I see that it first copies the 0th char to the destination, then checks for null , and then finally enters the loop where it checks for null again. So the loop remains largely the same but now it handles the 0th character before the loop. This of course is going to be sub-optimal compared with the first case.
I am wondering if anyone knows why the compiler is being prevented from making the same (or near same) code as the first example. Is this a ms compiler specific issue or possibly with my compiler/linker settings?
here is the full code, 2 files (1 function replaces the other).
// in first dll project
__declspec(dllexport) void docopy(const char* src, char* dst)
{
while (*dst++ = *src++);
}
__declspec(dllexport) void docopy(const char* src, char* dst)
{
while (*dst = *src)
{
++src;
++dst;
}
}
//seprate main.cpp file calls docopy
void docopy(const char* src, char* dst);
char* source ="source";
char destination[100];
int main()
{
docopy(source, destination);
}
The answer of course being the compiler was fed different code on the input so it is perfectly valid for the compiler to generate different output.
Because in the first example, the post-increment happens always, even if src starts out pointing to a null character. In the same starting situation, the second example would not increment the pointers.
Of course the compiler has other options. The "copy first byte then enter the loop if not 0" is what gcc-4.5.1 produces with -O1. With -O2 and -O3, it produces
.LFB0:
.cfi_startproc
jmp .L6 // jump to copy
.p2align 4,,10
.p2align 3
.L4:
addq $1, %rdi // increment pointers
addq $1, %rsi
.L6: // copy
movzbl (%rdi), %eax // get source byte
testb %al, %al // check for 0
movb %al, (%rsi) // move to dest
jne .L4 // loop if nonzero
rep
ret
.cfi_endproc
which is quite similar to what it produces for the K&R loop. Whether that's actually better I can't say, but it looks nicer.
Apart from the jump into the loop, the instructions for the K&R loop are exactly the same, just ordered differently:
.LFB0:
.cfi_startproc
.p2align 4,,10
.p2align 3
.L2:
movzbl (%rdi), %eax // get source byte
addq $1, %rdi // increment source pointer
movb %al, (%rsi) // move byte to dest
addq $1, %rsi // increment dest pointer
testb %al, %al // check for 0
jne .L2 // loop if nonzero
rep
ret
.cfi_endproc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With