On my machine Time A and Time B swap depending on whether A
is
defined or not (which changes the order in which the two calloc
s are called).
I initially attributed this to the paging system. Weirdly, when
mmap
is used instead of calloc
, the situation is even more bizzare -- both the loops take the same amount of time, as expected. As
can be seen with strace
, the calloc
s ultimately result in two
mmap
s, so there is no return-already-allocated-memory magic going on.
I'm running Debian testing on an Intel i7.
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#define SIZE 500002816
#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE, \
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif
int main() {
clock_t start, finish;
#ifdef A
int *arr1 = ALLOC(sizeof(int), SIZE);
int *arr2 = ALLOC(sizeof(int), SIZE);
#else
int *arr2 = ALLOC(sizeof(int), SIZE);
int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
int i;
start = clock();
{
for (i = 0; i < SIZE; i++)
arr1[i] = (i + 13) * 5;
}
finish = clock();
printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
start = clock();
{
for (i = 0; i < SIZE; i++)
arr2[i] = (i + 13) * 5;
}
finish = clock();
printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
return 0;
}
The output I get:
~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.94
Time B: 0.34
~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.34
Time B: 0.90
~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.89
Time B: 0.90
~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.91
Time B: 0.92
C realloc() method “realloc” or “re-allocation” method in C is used to dynamically change the memory allocation of a previously allocated memory. In other words, if the memory previously allocated with the help of malloc or calloc is insufficient, realloc can be used to dynamically re-allocate memory.
You can use Free() function. Answer: C free() Dynamically allocated memory created with either calloc() or malloc() doesn't get freed on its own. You must explicitly use free() to release the space.
The maximum value is 2 CHAR_BIT×sizeof(size_t) − 1 , or the constant SIZE_MAX in the C99 standard. This is useful information, but is about the largest single allocation request that malloc() can satisfy. The question seems to be about a limit on the sum total of many allocations.
malloc allocates memory to the heap to the program for it's use, free releases the memory from use back to the heap (and can be reused by a future call to malloc )
You should also test using malloc
instead of calloc
. One thing that calloc
does is to fill the allocated memory with zeros.
I believe in your case that when you calloc
arr1 last and then assign to it, it is already faulted into cache memory, since it was the last one allocated and zero-filled. When you calloc
arr1 first and arr2 second, then the zero-fill of arr2 pushes arr1 out of cache.
Guess I could have written more, or less, especially as less is more.
The reason can differ from system to system. However; for clib:
The total time used for each operation is the other way around; if you time
the calloc
+ the iteration.
I.e.:
Calloc arr1 : 0.494992654
Calloc arr2 : 0.000021250
Itr arr1 : 0.430646035
Itr arr2 : 0.790992411
Sum arr1 : 0.925638689
Sum arr2 : 0.791013661
Calloc arr1 : 0.503130736
Calloc arr2 : 0.000025906
Itr arr1 : 0.427719162
Itr arr2 : 0.809686047
Sum arr1 : 0.930849898
Sum arr2 : 0.809711953
The first calloc
subsequently malloc
has a longer execution time then
second. A call as i.e. malloc(0)
before any calloc
etc. evens out the time
used for malloc
like calls in same process (Explanation below). One can
however see an slight decline in time for these calls if one do several in line.
The iteration time, however, will flatten out.
So in short; The total system time used is highest on which ever get alloc'ed first. This is however an overhead that can't be escaped in the confinement of a process.
There is a lot of maintenance going on. A quick touch on some of the cases:
When a process request memory it is served a virtual address range. This range translates by a page table to physical memory. If a page translated byte by byte we would quickly get huge page tables. This, as one, is a reason why memory ranges are served in chunks - or pages. The page size are system dependent. The architecture can also provide various page sizes.
If we look at execution of above code and add some reads from /proc/PID/stat we see this in action (Esp. note RSS):
PID Stat {
PID : 4830 Process ID
MINFLT : 214 Minor faults, (no page memory read)
UTIME : 0 Time user mode
STIME : 0 Time kernel mode
VSIZE : 2039808 Virtual memory size, bytes
RSS : 73 Resident Set Size, Number of pages in real memory
} : Init
PID Stat {
PID : 4830 Process ID
MINFLT : 51504 Minor faults, (no page memory read)
UTIME : 4 Time user mode
STIME : 33 Time kernel mode
VSIZE : 212135936 Virtual memory size, bytes
RSS : 51420 Resident Set Size, Number of pages in real memory
} : Post calloc arr1
PID Stat {
PID : 4830 Process ID
MINFLT : 51515 Minor faults, (no page memory read)
UTIME : 4 Time user mode
STIME : 33 Time kernel mode
VSIZE : 422092800 Virtual memory size, bytes
RSS : 51428 Resident Set Size, Number of pages in real memory
} : Post calloc arr2
PID Stat {
PID : 4830 Process ID
MINFLT : 51516 Minor faults, (no page memory read)
UTIME : 36 Time user mode
STIME : 33 Time kernel mode
VSIZE : 422092800 Virtual memory size, bytes
RSS : 51431 Resident Set Size, Number of pages in real memory
} : Post iteration arr1
PID Stat {
PID : 4830 Process ID
MINFLT : 102775 Minor faults, (no page memory read)
UTIME : 68 Time user mode
STIME : 58 Time kernel mode
VSIZE : 422092800 Virtual memory size, bytes
RSS : 102646 Resident Set Size, Number of pages in real memory
} : Post iteration arr2
PID Stat {
PID : 4830 Process ID
MINFLT : 102776 Minor faults, (no page memory read)
UTIME : 68 Time user mode
STIME : 69 Time kernel mode
VSIZE : 2179072 Virtual memory size, bytes
RSS : 171 Resident Set Size, Number of pages in real memory
} : Post free()
As we can see pages actually allocated in memory is postponed for arr2
awaiting
page request; which lasts until iteration begins. If we add a malloc(0)
before
calloc
of arr1
we can register that neither array is allocated in physical
memory before iteration.
As a page might not be used it is more efficient to do the mapping on request.
This is why when the process i.e. do a calloc
the sufficient number of pages
are reserved, but not necessarily actually allocated in real memory.
When an address is referenced the page table is consulted. If the address is in a page which is not allocated the system serves a page fault and the page is subsequently allocated. Total sum of allocated pages is called Resident Set Size (RSS).
We can do an experiment with our array by iterating (touching) i.e. 1/4 of it.
Here I have also added malloc(0)
before any calloc
.
Pre iteration 1/4:
RSS : 171 Resident Set Size, Number of pages in real meory
for (i = 0; i < SIZE / 4; ++i)
arr1[i] = 0;
Post iteration 1/4:
RSS : 12967 Resident Set Size, Number of pages in real meory
Post iteration 1/1:
RSS : 51134 Resident Set Size, Number of pages in real meory
To further speed up things most systems additionally cache the N most recent page table entries in a translation lookaside buffer (TLB).
When a process (c|m|…)alloc
the upper bounds of the heap is expanded by
brk()
or sbrk()
. These system calls are expensive and to compensate for
this malloc
collect multiple smaller calls in to one bigger brk().
This also affects free()
as a negative brk()
also is resource expensive
they are collected and performed as a bigger operation.
For huge request; i.e. like the one in your code, malloc()
uses mmap()
.
The threshold for this, which is configurable by mallopt()
, is an educated
value
We can have fun with this modifying the SIZE
in your code. If we utilize
malloc.h
and use,
struct mallinfo minf = mallinfo();
(no, not milf), we can show this (Note Arena
and Hblkhd
, …):
Initial:
mallinfo {
Arena : 0 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 0 (Number of chunks allocated with mmap)
Hblkhd : 0 (Bytes allocated with mmap)
Uordblks: 0 (Memory occupied by chunks handed out by malloc)
Fordblks: 0 (Memory occupied by free chunks)
Keepcost: 0 (Size of the top-most releasable chunk)
} : Initial
MAX = ((128 * 1024) / sizeof(int))
mallinfo {
Arena : 0 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 1 (Number of chunks allocated with mmap)
Hblkhd : 135168 (Bytes allocated with mmap)
Uordblks: 0 (Memory occupied by chunks handed out by malloc)
Fordblks: 0 (Memory occupied by free chunks)
Keepcost: 0 (Size of the top-most releasable chunk)
} : After malloc arr1
mallinfo {
Arena : 0 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 2 (Number of chunks allocated with mmap)
Hblkhd : 270336 (Bytes allocated with mmap)
Uordblks: 0 (Memory occupied by chunks handed out by malloc)
Fordblks: 0 (Memory occupied by free chunks)
Keepcost: 0 (Size of the top-most releasable chunk)
} : After malloc arr2
Then we subtract sizeof(int)
from MAX
and get:
mallinfo {
Arena : 266240 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 0 (Number of chunks allocated with mmap)
Hblkhd : 0 (Bytes allocated with mmap)
Uordblks: 131064 (Memory occupied by chunks handed out by malloc)
Fordblks: 135176 (Memory occupied by free chunks)
Keepcost: 135176 (Size of the top-most releasable chunk)
} : After malloc arr1
mallinfo {
Arena : 266240 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 0 (Number of chunks allocated with mmap)
Hblkhd : 0 (Bytes allocated with mmap)
Uordblks: 262128 (Memory occupied by chunks handed out by malloc)
Fordblks: 4112 (Memory occupied by free chunks)
Keepcost: 4112 (Size of the top-most releasable chunk)
} : After malloc arr2
We register that the system works as advertised. If size of allocation is
below threshold sbrk
is used and memory handled internally by malloc
,
else mmap
is used.
The structure of this also helps on preventing fragmentation of memory etc.
Point being that the malloc
family is optimized for general usage. However
mmap
limits can be modified to meet special needs.
Note this (and down trough 100+ lines) when / if modifying mmap threshold. .
This can be further observed if we fill (touch) every page of arr1 and arr2 before we do the timing:
Touch pages … (Here with page size of 4 kB)
for (i = 0; i < SIZE; i += 4096 / sizeof(int)) {
arr1[i] = 0;
arr2[i] = 0;
}
Itr arr1 : 0.312462317
CPU arr1 : 0.32
Itr arr2 : 0.312869158
CPU arr2 : 0.31
Also see:
So, the CPU knows the physical address then? Nah.
In the world of memory a lot has to be addressed ;). A core hardware for this is the memory management unit (MMU). Either as an integrated part of the CPU or external chip.
The operating system configure the MMU on boot and define access for various regions (read only, read-write, etc.) thus giving a level of security.
The address we as mortals see is the logical address that the CPU uses. The MMU translates this to a physical address.
The CPU's address consist of two parts: a page address and a offset. [PAGE_ADDRESS.OFFSET]
And the process of getting a physical address we can have something like:
.-----. .--------------.
| CPU > --- Request page 2 ----> | MMU |
+-----+ | Pg 2 == Pg 4 |
| +------v-------+
+--Request offset 1 -+ |
| (Logical page 2 EQ Physical page 4)
[ ... ] __ | |
[ OFFSET 0 ] | | |
[ OFFSET 1 ] | | |
[ OFFSET 2 ] | | |
[ OFFSET 3 ] +--- Page 3 | |
[ OFFSET 4 ] | | |
[ OFFSET 5 ] | | |
[ OFFSET 6 ]__| ___________|____________+
[ OFFSET 0 ] | |
[ OFFSET 1 ] | ...........+
[ OFFSET 2 ] |
[ OFFSET 3 ] +--- Page 4
[ OFFSET 4 ] |
[ OFFSET 5 ] |
[ OFFSET 6 ]__|
[ ... ]
A CPU's logical address space is directly linked to the address length. A 32-bit address processor has a logical address space of 232 bytes. The physical address space is how much memory the system can afford.
There is also the handling of fragmented memory, re-alignment etc.
This brings us into the world of swap files. If a process request more memory then is physically available; one or several pages of other process(es) are transfered to disk/swap and their pages "stolen" by the requesting process. The MMU keeps track of this; thus the CPU doesn't have to worry about where the memory is actually located.
This further brings us on to dirty memory.
If we print some information from /proc/[pid]/smaps, more specific the range of our arrays we get something like:
Start:
b76f3000-b76f5000
Private_Dirty: 8 kB
Post calloc arr1:
aaeb8000-b76f5000
Private_Dirty: 12 kB
Post calloc arr2:
9e67c000-b76f5000
Private_Dirty: 20 kB
Post iterate 1/4 arr1
9e67b000-b76f5000
Private_Dirty: 51280 kB
Post iterate arr1:
9e67a000-b76f5000
Private_Dirty: 205060 kB
Post iterate arr2:
9e679000-b76f5000
Private_Dirty: 410096 kB
Post free:
9e679000-9e67d000
Private_Dirty: 16 kB
b76f2000-b76f5000
Private_Dirty: 12 kB
When a virtual page is created a system typically clears a dirty bit in the
page.
When the CPU writes to a part of this page the dirty bit is set; thus when
swapped the pages with dirty bits are written, clean pages are skipped.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With