I'm trying to understand the timings of malloc
and free
.
So I wrote this simple program:
#include <stdlib.h>
#include <stdio.h>
int
main()
{
long i;
for(i = 2; i < 10000000000; i*=2) {
struct timeval start, end;
double timing ;
long j;
gettimeofday(&start, NULL);
double *vect = malloc((size_t) i * sizeof(*vect));
if (!vect) {
printf("malloc failed\n");
exit(-1);
}
gettimeofday(&end, NULL);
timing = (double) (end.tv_sec * 1e6 + end.tv_usec) - (start.tv_sec * 1e6 + start.tv_usec);
printf("size %ld allocating (%f)\t", i * sizeof(*vect), timing);
/* I do this to avoid lazy allocation */
for(j = 0; j < i; j++)
vect[i] = 2;
gettimeofday(&start, NULL);
free(vect);
gettimeofday(&end, NULL);
timing = (double) (end.tv_sec * 1e6 + end.tv_usec) - (start.tv_sec * 1e6 + start.tv_usec);
printf("deallocating (%f)\n", timing);
}
return 0;
}
The output of this program looks This is what I get as ouput:
size 16 allocating (40.000000) deallocating (0.000000)
size 32 allocating (0.000000) deallocating (0.000000)
size 64 allocating (0.000000) deallocating (0.000000)
size 128 allocating (0.000000) deallocating (1.000000)
size 256 allocating (0.000000) deallocating (0.000000)
size 512 allocating (0.000000) deallocating (0.000000)
size 1024 allocating (1.000000) deallocating (0.000000)
size 2048 allocating (0.000000) deallocating (0.000000)
size 4096 allocating (1.000000) deallocating (0.000000)
size 8192 allocating (1.000000) deallocating (0.000000)
size 16384 allocating (1.000000) deallocating (0.000000)
size 32768 allocating (1.000000) deallocating (1.000000)
size 65536 allocating (1.000000) deallocating (0.000000)
size 131072 allocating (1.000000) deallocating (1.000000)
size 262144 allocating (2.000000) deallocating (4.000000)
size 524288 allocating (2.000000) deallocating (2.000000)
size 1048576 allocating (1.000000) deallocating (2.000000)
size 2097152 allocating (3.000000) deallocating (3.000000)
size 4194304 allocating (2.000000) deallocating (4.000000)
size 8388608 allocating (4.000000) deallocating (3.000000)
size 16777216 allocating (2.000000) deallocating (3.000000)
size 33554432 allocating (3.000000) deallocating (2.000000)
size 67108864 allocating (2.000000) deallocating (7.000000)
size 134217728 allocating (7.000000) deallocating (8.000000)
size 268435456 allocating (6.000000) deallocating (8.000000)
size 536870912 allocating (5.000000) deallocating (10.000000)
size 1073741824 allocating (6.000000) deallocating (12.000000)
size 2147483648 allocating (25.000000) deallocating (13.000000)
size 4294967296 allocating (7.000000) deallocating (11.000000)
size 8589934592 allocating (6.000000) deallocating (11.000000)
I'm really surprised how cheap the malloc
is when the size of the vector is increasing. Shouldn't it be increasing more drastically with the size?
And my second question is about the free
function. I always thought the malloc
is the expensive one, and not free
. It is more expensive, which for me does not make sense.
I have some knowledge on how the system handles the memory (physical pages and virtual pages), but these results done make sense to me. malloc
is not that expensive after all... or is it? :)
Any comments are welcomed!
edit: Thank a lot for all the fast comments! Very appreciated! I took into account the comments and I changed a bit my program. Instead of using malloc I used calloc. Additionally I go two time through the array and I time both of them. One in order to make sure that all the pages are allocated and the second one to test the timing of just accessing the array. There is clearly a difference that increases with the size of the array!
I'm trying to get some performance result of my algorithm, so I would like to get rid of this additional cost. Most of the memory used in my algo is allocated in the beginning. Is there any way to tell malloc to allocate and associate the memory? The goal it to have more reproducible (and better :) ) results.
size 262144 allocating (5.000000) first pass (166.000000) second pass (190.000000) diff between passes (24.000000) deallocating (10.000000)
size 524288 allocating (4.000000) first pass (330.000000) second pass (328.000000) diff between passes (2.000000) deallocating (3.000000)
size 1048576 allocating (2.000000) first pass (669.000000) second pass (673.000000) diff between passes (4.000000) deallocating (5.000000)
size 2097152 allocating (5.000000) first pass (1326.000000) second pass (1314.000000) diff between passes (12.000000) deallocating (6.000000)
size 4194304 allocating (4.000000) first pass (2655.000000) second pass (2586.000000) diff between passes (69.000000) deallocating (5.000000)
size 8388608 allocating (4.000000) first pass (4858.000000) second pass (4838.000000) diff between passes (20.000000) deallocating (5.000000)
size 16777216 allocating (3.000000) first pass (9034.000000) second pass (8458.000000) diff between passes (576.000000) deallocating (4.000000)
size 33554432 allocating (3.000000) first pass (15702.000000) second pass (14375.000000) diff between passes (1327.000000) deallocating (4.000000)
size 67108864 allocating (4.000000) first pass (25785.000000) second pass (23228.000000) diff between passes (2557.000000) deallocating (3.000000)
In most cases, this is very implementation dependent. But let us try to see how a typical malloc implementation would work.
In your case when the allocation size is small, the allocator tries to avoid internal fragmentation and does very compact allocations. This means some pages are already allocated and then future allocations are done from the same pages. This is a little bit clear by how your first allocation takes a bit more time than others(it also does heap data structure initialization here).
Anyway, each of these operations (of allocating small chunks) are equal in cost since it is just a few pointer updates. The allocation through sbrk
or mmap
is already done (Until it needs to allocate more).
Now coming to the bigger allocations. In these cases, many allocators simply fall back to mapping new pages(after page aligning the size). This requires new pages being allocated which is a system call.
Page allocation levels are done at page granularity. Which means that the updates needed to be done will be order of number of pages (again this is very operating system dependent, on some systems allocating one page might be equally costly as allocating 10000 pages).
As mentioned by @Ctx, most modern operating systems do not even update the page tables at the time of sbrk
or mmap
but they do it when the data is actually read/written on the pages. So the commit might be just some internal data structures being updated in the kernel.
Coming to free, it is generally very cheap for small allocations since it only involves returning the allocation to the free list and 1 or 2 merges. The heap doesn't necessarily return the pages in this case.
For big allocations the story is similar to allocation. A system call is made to de-commit the pages. The operation may or may not be proportional to number of pages.
Another factor which might affect your timings is the default behavior of your allocator. malloc
is not required to clear the memory before returning it, but some allocators do that (basically they behave like calloc
). In that case the cost of malloc may go up linearly. It might be worth while in your case to also do a similar comparison for calloc
with different sizes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With