Given a process that creates a large linux kernel page cache via mmap'd files, running in a docker container (cgroup) with a memory limit causes kernel slab allocation errors:
Jul 18 21:29:01 ip-10-10-17-135 kernel: [186998.252395] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)
Jul 18 21:29:01 ip-10-10-17-135 kernel: [186998.252402] cache: kmalloc-2048(2412:6c2c4ef2026a77599d279450517cb061545fa963ff9faab731daab2a1f672915), object size: 2048, buffer size: 2048, default order: 3, min order: 0
Jul 18 21:29:01 ip-10-10-17-135 kernel: [186998.252407] node 0: slabs: 135, objs: 1950, free: 64
Jul 18 21:29:01 ip-10-10-17-135 kernel: [186998.252409] node 1: slabs: 130, objs: 1716, free: 0
Watching slabtop
I can see the number of buffer_head, radix_tree_node and kmalloc* objects is heavily restricted in a container started with a memory limit. This appears to have pathologic consequences for IO throughput in the application and observable with iostat
. This does not happen even when the page cache consumes all available memory on the host OS running outside a container or a container with no memory limit.
This appears to be an issue in the kernel memory accounting where the kernel page cache is not counted against the containers memory, but the SLAB objects that support it are. The behavior appears to be aberrant because running when a large slab object pool is preallocated, the memory constrained container works fine, freely reusing the existing slab space. Only slab allocated in the container counts against the container. No combination of container options for memory and kernel-memory seems to fix the issue (except not setting a memory limit at all or a limit so large that is does not restrict the slab but this restricts the addressable space). I have tried to disable kmem accounting entirely with no success by passing cgroup.memory=nokmem
at boot.
System Info:
To reproduce the issue you can use my PageCache java code. This is a bare bones repro case of an embedded database library that heavily leverages memory mapped files to be deployed on a very fast file system. The application is deployed on AWS i3.baremetal instances via ECS. I am mapping a large volume from the host to the docker container where the memory mapped files are stored. The AWS ECS agent requires setting a non zero memory limit for all containers. The memory limit causes the pathologic slab behavior and the resulting application IO throughput is totally unacceptable.
It is helpful to drop_caches
between runs using echo 3 > /proc/sys/vm/drop_caches
. This will clear the page cache and the associated pool of slab objects.
Suggestions on how to fix, work around or even where report this issue would be welcome.
UPDATE It appears that updating to Ubuntu 18.04 with the 4.15 kernel does fix the observed kmalloc allocation error. The version of Java seems to be irrelevant. This appears to be because each v1 CGroup can only allocate page cache up to the memory limit (with multiple cgroups it is more complicated with only one cgroup being "charged" for the allocation via the Shared Page Accounting scheme). I believe this is now consistent with the intended behavior. In the 4.4 kernel we found that the observed kmalloc errors were an intersection of using software raid0 in a v1 Cgroup with a memory limit and a very large page cache. I believe the cgroups in the 4.4 kernel were able to map an unlimited number of pages (a bug which we found useful) upto the point at which the kernel ran out of memory for the associated slab objects, but I still don't have a smoking gun for the cause.
With the 4.15 kernel, our Docker containers are required to set a memory limit (via AWS ECS) so we have implemented a task to unset the memory limit as soon as the container is created in /sys/fs/cgroup/memory/docker/{contarainer_id}/memory.limit_in_bytes
. This appears to work though it is not a good practice to be sure. This allows the behavior we want - unlimited sharing of page cache resources on the host. Since we are running a JVM application with a fixed heap, the down side risk is limited.
For our use case, it would be fantastic to have the option to discount the page cache (mmap'd disk space) and associated slab objects entirely for a cgroup but maintain the limit on heap & stack for the docker process. The present Shared Page Accounting scheme is rather hard to reason about and we would prefer to allow the LRU page cache (and associated SLAB resources) to use the full extent of the hosts memory as is the case when the memory limit is not set at all.
I have started following some conversations on LWN but I am a bit in the dark. Maybe this is a terrible idea? I don't know... advice on how to proceed or where to go next is welcome.
The --memory parameter limits the container memory usage, and Docker will kill the container if the container tries to use more than the limited memory.
Set Maximum Memory Access To limit the maximum amount of memory usage for a container, add the --memory option to the docker run command. Alternatively, you can use the shortcut -m . Within the command, specify how much memory you want to dedicate to that specific container.
By default, if an out-of-memory (OOM) error occurs, the kernel kills processes in a container. To change this behavior, use the --oom-kill-disable option. Only disable the OOM killer on containers where you have also set the -m/--memory option.
By default, Docker does not apply memory limitations to individual containers. Containers can consume all available memory of the host.
If the kernel memory limit of the container is hit and more kernel memory is required, the Docker host (!, not the container) experiences a system-wide OOM (even though the host has more than enough free memory available). As a result, arbitrary applications and system services might be killed.
Below, find out how to configure Docker memory limitations. To limit the maximum amount of memory usage for a container, add the --memory option to the docker run command. Alternatively, you can use the shortcut -m. Within the command, specify how much memory you want to dedicate to that specific container.
Docker provides ways to control how much memory, or CPU a container can use, setting runtime configuration flags of the docker run command. This section provides details on when you should set such limits and the possible implications of setting them.
To run a container with an additional 1 GB of swap memory, set the swap memory to 2 GB. The syntax for running a container with limited memory and additional swap memory is: sudo docker run -it --memory=” [memory_limit]” --memory-swap=” [memory_limit]” [docker_image]
java 10.0.1 2018-04-17
You should try with a more recent version of java 10 (or 11 or...)
I mentioned in "Docker support in Java 8 — finally!" last May (2019), that new evolutions from Java 10, backported in Java 8, means Docker will report more accurately the memory used.
This article from may 2018 reports:
Succes! Without providing any flags Java 10 (10u46 -- Nightly) correctly detected Dockers memory limits.
The OP David confirms in the comments:
The docker - jvm integration is a big improvement in Java 10.
It is really to do with setting sane XMS and the number of processors. These now respect the docker container limits rather than picking up the host instance values (you can turn this feature off using-XX:-UseContainerSupport
depending on your use case).I have not found it helpful in dealing with the page cache though.
The best solution I have found is to disable the docker memory limit, after the container is created if need be.
This is definitely a hack - user beware.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With