I relatively familiar how virtual memory works. All process memory is divided into pages and every page of the virtual memory maps to a page in real memory or a page in swap file or it can be a new page which means that physical page is still not allocated. OS maps new pages to real memory on demand, not when an application asks for memory with <code>malloc</code>, but only when an application actually accesses every page from allocated memory. But I still have questions. I've noticed this when was profiling my app with linux <code>perf</code> tool. <img src="https://i.stack.imgur.com/wWPtb.png" alt="enter image description here"> There is about 20% of time took kernel functions: <code>clear_page_orig</code>, <code>__do_page_fault</code> and <code>get_page_from_free_list</code>. This is much more than I expected for this task and I've done some research. Let start with some small example: <pre class="prettyprint"><code>#include <stdlib.h> #include <string.h> #include <stdio.h> #define SIZE 1 * 1024 * 1024 int main(int argc, char *argv[]) { int i; int sum = 0; int *p = (int *) malloc(SIZE); for (i = 0; i < 10000; i ++) { memset(p, 0, SIZE); sum += p[512]; } free(p); printf("sum %d\n", sum); return 0; } </code></pre> Let assume that <code>memset</code> is just some memory bound processing. In this case, we allocate a small chunk of memory once and reuse it again and again. I'll run this program like this: <pre class="prettyprint"><code>$ gcc -O1 ./mem.c && time ./a.out </code></pre> <code>-O1</code> required because <code>clang</code> with <code>-O2</code> entirely eliminates the loop and calculates the value instantly. The results are: user: 0.520s, sys: 0.008s. According to <code>perf</code>, 99% of this time is in <code>memset</code> from <code>libc</code>. So for this case, the write performance is about 20 Gigabytes/s which is more than theoretical performance 12.5 Gb/s for my memory. Looks like this is due to L3 CPU cache. Let change test and start to allocate memory in loop (I'll not repeat same parts of code): <pre class="prettyprint"><code>#define SIZE 1 * 1024 * 1024 for (i = 0; i < 10000; i ++) { int *p = (int *) malloc(SIZE); memset(p, 0, SIZE); free(p); } </code></pre> The result is exactly the same. I believe that <code>free</code> doesn't actually free memory for OS, it just put it in some free list within the process. And <code>malloc</code> on next iteration just get exactly the same memory block. That is why there is no noticeable difference. Let start increase SIZE from 1 Megabyte. Execution time will grow little by little and will be saturated near 10 Megabytes (there is no difference for me between 10 and 20 megabytes). <pre class="prettyprint"><code>#define SIZE 10 * 1024 * 1024 for (i = 0; i < 1000; i ++) { int *p = (int *) malloc(SIZE); memset(p, 0, SIZE); free(p); } </code></pre> Time shows: user: 1.184s, sys: 0.004s. <code>perf</code> still reports than 99% of the time is in <code>memset</code>, but throughput is about 8.3 Gb/s. At that point, I understand what is going on, more or less. If we will continue to increase memory block size, at some point (for me on 35 Mb) execution time will increase dramatically: user: 0.724s, sys: 3.300s. <pre class="prettyprint"><code>#define SIZE 40 * 1024 * 1024 for (i = 0; i < 250; i ++) { int *p = (int *) malloc(SIZE); memset(p, 0, SIZE); free(p); } </code></pre> According to <code>perf</code>, <code>memset</code> will consume only 18% of a time. <img src="https://i.stack.imgur.com/xaSmt.png" alt="enter image description here"> Obviously, memory is allocated from OS and freed on each step. As I mentioned before, OS should clear each allocated page before use. So 27.3% of <code>clear_page_orig</code> doesn't look extraordinary: it is just 4s * 0.273 ≈ 1.1 sec for clear mem — the same we get in the third example. <code>memset</code> took 17.9%, which leads to ≈ 700 msec, which is normal due to the memory already in L3 cache after <code>clear_page_orig</code> (first and second example). What I can't understand — why the last case is 2 times slower than just <code>memset</code> for memory + <code>memset</code> for L3 cache? Can I do something with it? The results are reproducible (with small differences) on native Mac OS, Ubuntu under Vmware and Amazon c4.large instance. Also, I think there is a room for optimization on two levels: <ul> <li> on OS level. If OS knows that it returns a page to the same application which it was belongs to previously, it can not clear it.</li> <li> on CPU level. If CPU knows that the page used to be free, it can do not clear the page in memory. It can just clear it in the cache and move it to the memory only after some processing in the cache.</li> </ul>

What's happening here is a bit complicated as it involves a few different systems, but it is definitely not related to the context switch cost; your program makes very few system calls (verify this by using strace). First it's important to understand some basic principles about the way <code>malloc</code> implementations generally work: <ol> <li>Most <code>malloc</code> implementations obtain a bunch of memory from the OS by calling <code>sbrk</code> or <code>mmap</code> during initialization. The amount of memory obtained can be adjusted in some <code>malloc</code> implementations. Once the memory is obtained, it is typically cut into different size classes and arranged in a data structure so that when a program requests memory with e.g., <code>malloc(123)</code>, the <code>malloc</code> implementation can quickly find a piece of memory matching those requirements.</li> <li>When you call <code>free</code>, memory is returned to a free list and can be re-used on subsequent calls to <code>malloc</code>. Some <code>malloc</code> implementations allow you to tune precisely how this works.</li> <li>When you allocate large chunks of memory, most <code>malloc</code> implementations will simply pass calls for huge amounts of memory straight to the <code>mmap</code> system call, which allocates "pages" of memory at at time. For most systems, 1 page of memory is 4096 bytes.</li> <li>Related, most OS's will attempt to clear pages of memory before handing them out to processes which have requested memory via <code>mmap</code> or <code>sbrk</code>. This is why you see calls to <code>clear_page_orig</code> in the perf output. This function is attempting to write 0s to pages of memory.</li> </ol> Now, these principles intersect with another idea which has many names but is commonly referred to as "demand paging." What "demand paging" means is that when a user program requests a chunk of memory from the OS (say by calling <code>mmap</code>), the memory is allocated in the virtual address space of the process, but there is no physical RAM backing that memory yet. Here's an outline of the demand paging process: <ol> <li>A program called <code>mmap</code> to allocate 500MB of RAM.</li> <li>The kernel maps a region of addresses in the process' address space for the 500 MB of RAM requested. It maps a "few" (OS dependent) pages (4096 bytes each, usually) of physical RAM to back those virtual addresses. </li> <li>The user program begins accessing memory by writing to it.</li> <li>Eventually, the user program will access an address that is valid, but has no physical RAM backing it.</li> <li>This generates a page fault on the CPU.</li> <li>The kernel responds to the page fault by seeing that the process is accessing a valid address, but one without physical RAM backing it.</li> <li>The kernel then finds RAM to allocate to that region. This can be slow if memory for other processes needs to be written to disk, first ("swapped out").</li> </ol> The most likely reason why you are seeing a performance degradation on the last case is because: <ol> <li>Your kernel has run out of zero'd page of memory that can be distributed to fulfill your request for 40 MB, thus it is zeroing memory over and over as evidenced by your perf output.</li> <li>You are generating pagefaults as you access memory that is not mapped in yet. Since you are accessing 40mb instead of 10mb, you will generate more page faults as there are more pages of memory that need to be mapped in.</li> <li>As another answer pointed out, <code>memset</code> is O(n) meaning the more memory you need to write to, the longer it will take.</li> <li>Less likely, since 40mb is not much RAM these days, but check the amount of free memory on your system just to be sure you have enough RAM.</li> </ol> If your application is extremely performance sensitive you can instead call <code>mmap</code> directly and: <ol> <li>pass the <code>MAP_POPULATE</code> flag which will cause all the page faults to happen up-front and map all the physical memory in -- then you won't be paying the cost to page fault on access. </li> <li>pass the <code>MAP_UNINITIALIZED</code> flag which will attempt to avoid zeroing pages of memory prior to distributing them to your process. Note that using this flag is a security concern and should not be used unless you fully understand the implications of using this option. It is possible that process could be issued pages of memory that were used by other unrelated processes for storing sensitive information. Also note that your kernel must be compiled to allow this option. Most kernels (like the AWS Linux kernel) do not come with this option enabled by default. You should almost certainly not use this option.</li> </ol> I would caution you that this level of optimization is almost always a mistake; most applications have much lower hanging fruit for optimization that does not involve optimizing the page fault cost. In a real world application, I'd recommend: <ol> <li>Avoiding the use of <code>memset</code> on large blocks of memory unless it is truly necessary. Most of the time, zeroing memory prior to re-use by the same process is not necessary.</li> <li>Avoiding allocating and free the same chunks of memory over and over; perhaps you can simply allocate a large block up front and re-use it as needed later.</li> <li>Using the <code>MAP_POPULATE</code> flag above if the cost of the page faults on access is truly detrimental to performance (unlikely).</li> </ol> Please leave comments if you have any questions and I'll be happy to edit this post an expand on this a bit if needed.

I'm not certain, but I'm willing to bet the cost of context switching from user mode to kernel, and back again, is dominating everything else. <code>memset</code> also takes significant time -- remember it's going to be O(n). Update <blockquote> I believe that free doesn't actually free memory for OS, it just put it in some free list within the process. And malloc on next iteration just get exactly the same memory block. That is why there is no noticeable difference. </blockquote> This is, in principle, correct. The classic <code>malloc</code> implementation allocates memory on a singly-linked list; <code>free</code> simply sets a flag saying the allocation is no longer used. As time goes on, <code>malloc</code> reallocates the first time it can find a free block big enough. This works well enough, but can lead to fragmentation. There are a number of slicker implementations now, see this Wikipedia article.

Why is memory allocation for processes slow and can it be faster?

Tags:

c

memory-management

operating-system

linux-kernel

I relatively familiar how virtual memory works. All process memory is divided into pages and every page of the virtual memory maps to a page in real memory or a page in swap file or it can be a new page which means that physical page is still not allocated. OS maps new pages to real memory on demand, not when an application asks for memory with malloc, but only when an application actually accesses every page from allocated memory. But I still have questions.

I've noticed this when was profiling my app with linux perf tool.

enter image description here

There is about 20% of time took kernel functions: clear_page_orig, __do_page_fault and get_page_from_free_list. This is much more than I expected for this task and I've done some research.

Let start with some small example:

#include <stdlib.h> #include <string.h> #include <stdio.h>  #define SIZE 1 * 1024 * 1024  int main(int argc, char *argv[]) {   int i;   int sum = 0;   int *p = (int *) malloc(SIZE);   for (i = 0; i < 10000; i ++) {     memset(p, 0, SIZE);     sum += p[512];   }   free(p);   printf("sum %d\n", sum);   return 0; }

Let assume that memset is just some memory bound processing. In this case, we allocate a small chunk of memory once and reuse it again and again. I'll run this program like this:

$ gcc -O1 ./mem.c && time ./a.out

-O1 required because clang with -O2 entirely eliminates the loop and calculates the value instantly.

The results are: user: 0.520s, sys: 0.008s. According to perf, 99% of this time is in memset from libc. So for this case, the write performance is about 20 Gigabytes/s which is more than theoretical performance 12.5 Gb/s for my memory. Looks like this is due to L3 CPU cache.

Let change test and start to allocate memory in loop (I'll not repeat same parts of code):

#define SIZE 1 * 1024 * 1024 for (i = 0; i < 10000; i ++) {   int *p = (int *) malloc(SIZE);   memset(p, 0, SIZE);   free(p); }

The result is exactly the same. I believe that free doesn't actually free memory for OS, it just put it in some free list within the process. And malloc on next iteration just get exactly the same memory block. That is why there is no noticeable difference.

Let start increase SIZE from 1 Megabyte. Execution time will grow little by little and will be saturated near 10 Megabytes (there is no difference for me between 10 and 20 megabytes).

#define SIZE 10 * 1024 * 1024 for (i = 0; i < 1000; i ++) {   int *p = (int *) malloc(SIZE);   memset(p, 0, SIZE);   free(p); }

Time shows: user: 1.184s, sys: 0.004s. perf still reports than 99% of the time is in memset, but throughput is about 8.3 Gb/s. At that point, I understand what is going on, more or less.

If we will continue to increase memory block size, at some point (for me on 35 Mb) execution time will increase dramatically: user: 0.724s, sys: 3.300s.

#define SIZE 40 * 1024 * 1024 for (i = 0; i < 250; i ++) {   int *p = (int *) malloc(SIZE);   memset(p, 0, SIZE);   free(p); }

According to perf, memset will consume only 18% of a time.

enter image description here

Obviously, memory is allocated from OS and freed on each step. As I mentioned before, OS should clear each allocated page before use. So 27.3% of clear_page_orig doesn't look extraordinary: it is just 4s * 0.273 ≈ 1.1 sec for clear mem — the same we get in the third example. memset took 17.9%, which leads to ≈ 700 msec, which is normal due to the memory already in L3 cache after clear_page_orig (first and second example).

What I can't understand — why the last case is 2 times slower than just memset for memory + memset for L3 cache? Can I do something with it?

The results are reproducible (with small differences) on native Mac OS, Ubuntu under Vmware and Amazon c4.large instance.

Also, I think there is a room for optimization on two levels:

on OS level. If OS knows that it returns a page to the same application which it was belongs to previously, it can not clear it.
on CPU level. If CPU knows that the page used to be free, it can do not clear the page in memory. It can just clear it in the cache and move it to the memory only after some processing in the cache.

838

asked Oct 09 '16 19:10

homm

2 Answers

What's happening here is a bit complicated as it involves a few different systems, but it is definitely not related to the context switch cost; your program makes very few system calls (verify this by using strace).

First it's important to understand some basic principles about the way malloc implementations generally work:

Most malloc implementations obtain a bunch of memory from the OS by calling sbrk or mmap during initialization. The amount of memory obtained can be adjusted in some malloc implementations. Once the memory is obtained, it is typically cut into different size classes and arranged in a data structure so that when a program requests memory with e.g., malloc(123), the malloc implementation can quickly find a piece of memory matching those requirements.
When you call free, memory is returned to a free list and can be re-used on subsequent calls to malloc. Some malloc implementations allow you to tune precisely how this works.
When you allocate large chunks of memory, most malloc implementations will simply pass calls for huge amounts of memory straight to the mmap system call, which allocates "pages" of memory at at time. For most systems, 1 page of memory is 4096 bytes.
Related, most OS's will attempt to clear pages of memory before handing them out to processes which have requested memory via mmap or sbrk. This is why you see calls to clear_page_orig in the perf output. This function is attempting to write 0s to pages of memory.

Now, these principles intersect with another idea which has many names but is commonly referred to as "demand paging." What "demand paging" means is that when a user program requests a chunk of memory from the OS (say by calling mmap), the memory is allocated in the virtual address space of the process, but there is no physical RAM backing that memory yet.

Here's an outline of the demand paging process:

A program called mmap to allocate 500MB of RAM.
The kernel maps a region of addresses in the process' address space for the 500 MB of RAM requested. It maps a "few" (OS dependent) pages (4096 bytes each, usually) of physical RAM to back those virtual addresses.
The user program begins accessing memory by writing to it.
Eventually, the user program will access an address that is valid, but has no physical RAM backing it.
This generates a page fault on the CPU.
The kernel responds to the page fault by seeing that the process is accessing a valid address, but one without physical RAM backing it.
The kernel then finds RAM to allocate to that region. This can be slow if memory for other processes needs to be written to disk, first ("swapped out").

The most likely reason why you are seeing a performance degradation on the last case is because:

Your kernel has run out of zero'd page of memory that can be distributed to fulfill your request for 40 MB, thus it is zeroing memory over and over as evidenced by your perf output.
You are generating pagefaults as you access memory that is not mapped in yet. Since you are accessing 40mb instead of 10mb, you will generate more page faults as there are more pages of memory that need to be mapped in.
As another answer pointed out, memset is O(n) meaning the more memory you need to write to, the longer it will take.
Less likely, since 40mb is not much RAM these days, but check the amount of free memory on your system just to be sure you have enough RAM.

If your application is extremely performance sensitive you can instead call mmap directly and:

pass the MAP_POPULATE flag which will cause all the page faults to happen up-front and map all the physical memory in -- then you won't be paying the cost to page fault on access.
pass the MAP_UNINITIALIZED flag which will attempt to avoid zeroing pages of memory prior to distributing them to your process. Note that using this flag is a security concern and should not be used unless you fully understand the implications of using this option. It is possible that process could be issued pages of memory that were used by other unrelated processes for storing sensitive information. Also note that your kernel must be compiled to allow this option. Most kernels (like the AWS Linux kernel) do not come with this option enabled by default. You should almost certainly not use this option.

I would caution you that this level of optimization is almost always a mistake; most applications have much lower hanging fruit for optimization that does not involve optimizing the page fault cost. In a real world application, I'd recommend:

Avoiding the use of memset on large blocks of memory unless it is truly necessary. Most of the time, zeroing memory prior to re-use by the same process is not necessary.
Avoiding allocating and free the same chunks of memory over and over; perhaps you can simply allocate a large block up front and re-use it as needed later.
Using the MAP_POPULATE flag above if the cost of the page faults on access is truly detrimental to performance (unlikely).

Please leave comments if you have any questions and I'll be happy to edit this post an expand on this a bit if needed.

165

answered Oct 09 '22 02:10

Joe Damato

I'm not certain, but I'm willing to bet the cost of context switching from user mode to kernel, and back again, is dominating everything else. memset also takes significant time -- remember it's going to be O(n).

Update

I believe that free doesn't actually free memory for OS, it just put it in some free list within the process. And malloc on next iteration just get exactly the same memory block. That is why there is no noticeable difference.

This is, in principle, correct. The classic malloc implementation allocates memory on a singly-linked list; free simply sets a flag saying the allocation is no longer used. As time goes on, malloc reallocates the first time it can find a free block big enough. This works well enough, but can lead to fragmentation.

There are a number of slicker implementations now, see this Wikipedia article.

answered Oct 09 '22 02:10

Charlie Martin

Related questions
                            
                                Status of __STDC_IEC_559__ with modern C compilers
                            
                                C-to-hardware compiler (HLL synthesis) [closed]
                            
                                Why is the gcc math library so inefficient?
                            
                                Checking stack usage at compile time
                            
                                Will converting a string to a double equal the literal double?
                            
                                What does -D_DEFAULT_SOURCE do?
                            
                                What is a simple and reliable C library for working with Excel files? [closed]
                            
                                Looking for OpenCV tutorial [closed]
                            
                                How do *nix pseudo-terminals work ? What's the master/slave channel?
                            
                                Why use shm_open?
                            
                                Cost of push vs. mov (stack vs. near memory), and the overhead of function calls
                            
                                Why was getenv standardised but not setenv?
                            
                                Bitshift and integer promotion?
                            
                                Casting a void pointer to a struct
                            
                                Does cast between signed and unsigned int maintain exact bit pattern of variable in memory?
                            
                                Mixing C and assembly sources and build with cmake
                            
                                returning a pointer to a literal (or constant) character array (string)?
                            
                                declaring variables without any data type in c
                            
                                malloc behaviour on an embedded system
                            
                                What is the effect of trailing white space in a scanf() format string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With