Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linux over commit heuristic

The over commit article from the kernel doc just mentions that over commit mode 0 is based on heuristic over commit handling. It does not outline the heuristic involved.

Could someone shed light on what the actual heuristic is ? Any relevant link to the kernel sources works too !

like image 629
KodeWarrior Avatar asked Jul 31 '16 22:07

KodeWarrior


1 Answers

Actually, kernel documentation of overcommit accounting has some details: https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

The Linux kernel supports the following overcommit handling modes

0 - Heuristic overcommit handling.

Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.

Also Documentation/sysctl/vm.txt

overcommit_memory: This value contains a flag that enables memory overcommitment.
When this flag is 0, the kernel attempts to estimate the amount of free memory left when userspace requests more memory...

See Documentation/vm/overcommit-accounting and mm/mmap.c::__vm_enough_memory() for more information.

Also, man 5 proc:

/proc/sys/vm/overcommit_memory This file contains the kernel virtual memory accounting mode. Values are:

                0: heuristic overcommit (this is the default)
                1: always overcommit, never check
                2: always check, never overcommit

In mode 0, calls of mmap(2) with MAP_NORESERVE are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed".

So, very huge allocations are disabled by heuristic, but sometimes application may allocate more virtual memory than size of physical memory in system, if it does not use all of it. With MAP_NORESERVE amount of mmapable memory may be higher.

The setting is "The overcommit policy is set via the sysctl `vm.overcommit_memory'", so we can find how it is implemented in the source code: http://lxr.free-electrons.com/ident?v=4.4;i=sysctl_overcommit_memory, defined at line 112 of mm/mmap.c

  112 int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;  /* heuristic overcommit */

and constant OVERCOMMIT_GUESS (defined in linux/mman.h) is used actually only in line 170 of mm/mmap.c, this is implementation of the heuristic:

138 /*
139  * Check that a process has enough memory to allocate a new virtual
140  * mapping. 0 means there is enough memory for the allocation to
141  * succeed and -ENOMEM implies there is not.
142  *
143  * We currently support three overcommit policies, which are set via the
144  * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
145  *
146  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
147  * Additional code 2002 Jul 20 by Robert Love.
148  *
149  * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
150  *
151  * Note this is a helper function intended to be used by LSMs which
152  * wish to use this logic.
153  */
154 int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
...
170         if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
171                 free = global_page_state(NR_FREE_PAGES);
172                 free += global_page_state(NR_FILE_PAGES);
173 
174                 /*
175                  * shmem pages shouldn't be counted as free in this
176                  * case, they can't be purged, only swapped out, and
177                  * that won't affect the overall amount of available
178                  * memory in the system.
179                  */
180                 free -= global_page_state(NR_SHMEM);
181 
182                 free += get_nr_swap_pages();
183 
184                 /*
185                  * Any slabs which are created with the
186                  * SLAB_RECLAIM_ACCOUNT flag claim to have contents
187                  * which are reclaimable, under pressure.  The dentry
188                  * cache and most inode caches should fall into this
189                  */
190                 free += global_page_state(NR_SLAB_RECLAIMABLE);
191 
192                 /*
193                  * Leave reserved pages. The pages are not for anonymous pages.
194                  */
195                 if (free <= totalreserve_pages)
196                         goto error;
197                 else
198                         free -= totalreserve_pages;
199 
200                 /*
201                  * Reserve some for root
202                  */
203                 if (!cap_sys_admin)
204                         free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
205 
206                 if (free > pages)
207                         return 0;
208 
209                 goto error;
210         }

So, the heuristic is the way to estimate how many physical memory pages are used now (free), when request for more memory is processed (applications asks for pages pages).

With always enabled overcommit ("1"), this function always returns 0 ("there is enough memory for this request")

164         /*
165          * Sometimes we want to use more memory than we have
166          */
167         if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
168                 return 0;

Without this default heuristic, in mode "2", kernel will try to account the requested pages pages to get new Committed_AS (from /proc/meminfo):

162         vm_acct_memory(pages);
...

this is actually just increment of vm_committed_as - __percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);

212         allowed = vm_commit_limit();

Some magic is here:

401 /*
402  * Committed memory limit enforced when OVERCOMMIT_NEVER policy is used
403  */
404 unsigned long vm_commit_limit(void)
405 {
406         unsigned long allowed;
407 
408         if (sysctl_overcommit_kbytes)
409                 allowed = sysctl_overcommit_kbytes >> (PAGE_SHIFT - 10);
410         else
411                 allowed = ((totalram_pages - hugetlb_total_pages())
412                            * sysctl_overcommit_ratio / 100);
413         allowed += total_swap_pages;
414 
415         return allowed;
416 }
417 

So, allowed is set either as kilobytes in vm.overcommit_kbytes sysctl or as vm.overcommit_ratio as percentage of physical RAM, plus swap sizes.

213         /*
214          * Reserve some for root
215          */
216         if (!cap_sys_admin)
217                 allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);

Allow some amount memory only for root (Page_shift is 12 for healthy person, page_shift-10 is just conversion from kbytes to page count).

218 
219         /*
220          * Don't let a single process grow so big a user can't recover
221          */
222         if (mm) {
223                 reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
224                 allowed -= min_t(long, mm->total_vm / 32, reserve);
225         }
226 
227         if (percpu_counter_read_positive(&vm_committed_as) < allowed)
228                 return 0;

If after accounting for request, all userspace still has memory amount committed less than allowed, allocate it. In other case, deny the request (and unaccount the request).

229 error:
230         vm_unacct_memory(pages);
231 
232         return -ENOMEM;

In other words, as summed in "The Linux kernel. Some remarks on the Linux Kernel", 2003-02-01 by Andries Brouwer, 9. Memory, 9.6 Overcommit and OOM - https://www.win.tue.nl/~aeb/linux/lk/lk-9.html:

Going in the right direction

Since 2.5.30 the values are:

  • 0 (default): as before: guess about how much overcommitment is reasonable,
  • 1: never refuse any malloc(),
  • 2: be precise about the overcommit - never commit a virtual address space larger than swap space plus a fraction overcommit_ratio of the physical memory.

So "2" is precise calculation of memory amount used after the request, and "0" is heuristic estimation.

like image 144
osgx Avatar answered Oct 18 '22 00:10

osgx