I have the following code that writes a global array with zeros twice, once forward and once backward.
#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000
char c[SIZE];
char c2[SIZE];
int main()
{
int i;
clock_t t = clock();
for(i = 0; i < SIZE; i++)
c[i] = 0;
t = clock() - t;
printf("%d\n\n", t);
t = clock();
for(i = SIZE - 1; i >= 0; i--)
c[i] = 0;
t = clock() - t;
printf("%d\n\n", t);
}
I've run it a couple and the second print is always showing a smaller value...
However, if I change change c to c2 in one of the loops, the time difference between both prints becomes negligible... what is the reason for that difference?
EDIT:
I've tried compiling with -O3 and looked into the assembly: there were 2 calls to memset but the second was still printing a smaller value.
When you defined some global data in C, it is zero-initialized:
char c[SIZE];
char c2[SIZE];
In linux (unix) world this means, than both c
and c2
will be allocated in special ELF file section, the .bss
:
... data segment containing statically-allocated variables represented solely by zero-valued bits initially
The .bss
segment is created to not store all zeroes in the binary, it just says something like "this program wants to have 200MB of zeroed memory".
When you program is loaded, ELF loader (kernel in case of classic static binaries, or ld.so
dynamic loader also known as interp
) will allocate the memory for .bss
, usually like something like mmap
with MAP_ANONYMOUS
flag and READ+WRITE permissions/protection request.
But memory manager in the OS kernel will not give you all 200 MB of zero memory. Instead it will mark part of virtual memory of your process as zero-initialized, and every page of this memory will point to the special zero page in physical memory. This page has 4096 bytes of zero byte, so if you are reading from c
or c2
, you will get zero bytes; and this mechanism allow kernel cut down memory requirements.
The mappings to zero page are special; they are marked (in page table) as read-only. When you do first write to the any of such virtual pages, the General protection fault or pagefault exception will be generated by hardware (I'll say, by MMU and TLB). This fault will be handled by kernel, and in your case, by minor pagefault handler. It will allocate one physical page, fill it by zero bytes, and reset mapping of just-accesed virtual page to this physical page. Then it will rerun faulted instruction.
I converted your code a bit (both loops are moved to separate function):
$ cat b.c
#include <string.h>
#include <time.h>
#include <stdio.h>
#define SIZE 100000000
char c[SIZE];
char c2[SIZE];
void FIRST()
{
int i;
for(i = 0; i < SIZE; i++)
c[i] = 0;
}
void SECOND()
{
int i;
for(i = 0; i < SIZE; i++)
c[i] = 0;
}
int main()
{
int i;
clock_t t = clock();
FIRST();
t = clock() - t;
printf("%d\n\n", t);
t = clock();
SECOND();
t = clock() - t;
printf("%d\n\n", t);
}
Compile with gcc b.c -fno-inline -O2 -o b
, then run under linux's perf stat
or more generic /usr/bin/time
to get pagefault count:
$ perf stat ./b
139599
93283
Performance counter stats for './b':
....
24 550 page-faults # 0,100 M/sec
$ /usr/bin/time ./b
234246
92754
Command exited with non-zero status 7
0.18user 0.15system 0:00.34elapsed 99%CPU (0avgtext+0avgdata 98136maxresident)k
0inputs+8outputs (0major+24576minor)pagefaults 0swaps
So, we have 24,5 thousands of minor pagefaults. With standard page size on x86/x86_64 of 4096 this is near 100 megabytes.
With perf record
/perf report
linux profiler we can find, where pagefaults occur (are generated):
$ perf record -e page-faults ./b
...skip some spam from non-root run of perf...
213322
97841
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (~801 samples) ]
$ perf report -n |cat
...
# Samples: 467 of event 'page-faults'
# Event count (approx.): 24583
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. .......................
#
98.73% 459 b b [.] FIRST
0.81% 1 b libc-2.19.so [.] __new_exitfn
0.35% 1 b ld-2.19.so [.] _dl_map_object_deps
0.07% 1 b ld-2.19.so [.] brk
....
So, now we can see, that only FIRST
function generates pagefaults (on first write to bss pages), and SECOND
does not generate any. Every pagefault corresponds to some work, done by OS kernel, and this work is done only one time per page of bss (because bss is not unmapped and remapped back).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With