Why is the second loop over a static array in the BSS faster than the first?

Question

I have the following code that writes a global array with zeros twice, once forward and once backward.

#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000

char c[SIZE];
char c2[SIZE];

int main()
{
   int i;
   clock_t t = clock();
   for(i = 0; i < SIZE; i++)
       c[i] = 0;

   t = clock() - t;
   printf("%d

", t);

   t = clock(); 
   for(i = SIZE - 1; i >= 0; i--)
      c[i] = 0;

   t = clock() - t;
   printf("%d

", t);
}

I've run it a couple and the second print is always showing a smaller value...

However, if I change change c to c2 in one of the loops, the time difference between both prints becomes negligible... what is the reason for that difference?

EDIT:

I've tried compiling with -O3 and looked into the assembly: there were 2 calls to memset but the second was still printing a smaller value.

osgx · Accepted Answer

When you defined some global data in C, it is zero-initialized:

char c[SIZE];
char c2[SIZE];

In linux (unix) world this means, than both c and c2 will be allocated in special ELF file section, the .bss:

... data segment containing statically-allocated variables represented solely by zero-valued bits initially

The .bss segment is created to not store all zeroes in the binary, it just says something like "this program wants to have 200MB of zeroed memory".

When you program is loaded, ELF loader (kernel in case of classic static binaries, or ld.so dynamic loader also known as interp) will allocate the memory for .bss, usually like something like mmap with MAP_ANONYMOUS flag and READ+WRITE permissions/protection request.

But memory manager in the OS kernel will not give you all 200 MB of zero memory. Instead it will mark part of virtual memory of your process as zero-initialized, and every page of this memory will point to the special zero page in physical memory. This page has 4096 bytes of zero byte, so if you are reading from c or c2, you will get zero bytes; and this mechanism allow kernel cut down memory requirements.

The mappings to zero page are special; they are marked (in page table) as read-only. When you do first write to the any of such virtual pages, the General protection fault or pagefault exception will be generated by hardware (I'll say, by MMU and TLB). This fault will be handled by kernel, and in your case, by minor pagefault handler. It will allocate one physical page, fill it by zero bytes, and reset mapping of just-accesed virtual page to this physical page. Then it will rerun faulted instruction.

I converted your code a bit (both loops are moved to separate function):

$ cat b.c
#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000

char c[SIZE];
char c2[SIZE];

void FIRST()
{
   int i;
   for(i = 0; i < SIZE; i++)
       c[i] = 0;
}

void SECOND()
{
   int i;
   for(i = 0; i < SIZE; i++)
       c[i] = 0;
}


int main()
{
   int i;
   clock_t t = clock();
   FIRST();
   t = clock() - t;
   printf("%d

", t);

   t = clock(); 
   SECOND();

   t = clock() - t;
   printf("%d

", t);
}

Compile with gcc b.c -fno-inline -O2 -o b, then run under linux's perf stat or more generic /usr/bin/time to get pagefault count:

$ perf stat ./b
139599

93283


 Performance counter stats for './b':
 ....
            24 550 page-faults               #    0,100 M/sec           


$ /usr/bin/time ./b
234246

92754

Command exited with non-zero status 7
0.18user 0.15system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 98136maxresident)k
0inputs+8outputs (0major+24576minor)pagefaults 0swaps

So, we have 24,5 thousands of minor pagefaults. With standard page size on x86/x86_64 of 4096 this is near 100 megabytes.

With perf record/perf report linux profiler we can find, where pagefaults occur (are generated):

$ perf record -e page-faults ./b
...skip some spam from non-root run of perf...
213322

97841

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (~801 samples) ]

$ perf report -n |cat
...
# Samples: 467  of event 'page-faults'
# Event count (approx.): 24583
#
# Overhead       Samples  Command      Shared Object                   Symbol
# ........  ............  .......  .................  .......................
#
    98.73%           459        b  b                  [.] FIRST              
     0.81%             1        b  libc-2.19.so       [.] __new_exitfn       
     0.35%             1        b  ld-2.19.so         [.] _dl_map_object_deps
     0.07%             1        b  ld-2.19.so         [.] brk                
     ....

So, now we can see, that only FIRST function generates pagefaults (on first write to bss pages), and SECOND does not generate any. Every pagefault corresponds to some work, done by OS kernel, and this work is done only one time per page of bss (because bss is not unmapped and remapped back).

Why is the second loop over a static array in the BSS faster than the first?

Tags:

c

loops

optimization

for-loop

gcc

nightshade

1 Answers

osgx

Recent Activity

Donate For Us

Why is the second loop over a static array in the BSS faster than the first?

Tags:

c

loops

optimization

for-loop

gcc

nightshade

1 Answers

osgx

Related questions

Recent Activity

Donate For Us