Cannot open /proc/self/oom_score_adj when I have the right capability

Question

I'm trying to set the OOM killer score adjustment for a process, inspired by oom_adjust_setup in OpenSSH's port_linux.c. To do that, I open /proc/self/oom_score_adj, read the old value, and write a new value. Obviously, my process needs to be root or have the capability CAP_SYS_RESOURCE to do that.

I'm getting a result that I can't explain. When my process doesn't have the capability, I'm able to open that file and read and write values, though the value I write doesn't take effect (fair enough):

$ ./a.out 
CAP_SYS_RESOURCE: not effective, not permitted, not inheritable
oom_score_adj value: 0
wrote 5 bytes
oom_score_adj value: 0

But when my process does have the capability, I can't even open the file: it fails with EACCES:

$ sudo setcap CAP_SYS_RESOURCE+eip a.out
$ ./a.out 
CAP_SYS_RESOURCE: effective, permitted, not inheritable
failed to open /proc/self/oom_score_adj: Permission denied

Why does it do that? What am I missing?

Some further googling led me to this lkml post by Azat Khuzhin on 20 Oct 2013. Apparently CAP_SYS_RESOURCE lets you change oom_score_adj for any process but yourself. To change your own score adjustment, you need to combine it with CAP_DAC_OVERRIDE - that is, disable access controls for all files. (If I wanted that, I would have made this program setuid root.)

So my question is, how can I achieve this without CAP_DAC_OVERRIDE?

I'm running Ubuntu xenial 16.04.4, kernel version 4.13.0-45-generic. My problem is similar to but different from this question: that's about an error on write, when not having the capability.

My sample program:

#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <sys/capability.h>

void read_value(FILE *fp)
{
  int value;
  rewind(fp);
  if (fscanf(fp, "%d", &value) != 1) {
    fprintf(stderr, "read failed: %s
", ferror(fp) ? strerror(errno) : "cannot parse");
  }
  else {
    fprintf(stderr, "oom_score_adj value: %d
", value);
  }
}

void write_value(FILE *fp)
{
  int result;
  rewind(fp);
  result = fprintf(fp, "-1000");
  if (result < 0) {
    fprintf(stderr, "write failed: %s
", strerror(errno));
  }
  else {
    fprintf(stderr, "wrote %d bytes
", result);
  }
}

int main()
{
  FILE *fp;

  struct __user_cap_header_struct h;
  struct __user_cap_data_struct d;

  h.version = _LINUX_CAPABILITY_VERSION_3;
  h.pid = 0;
  if (0 != capget(&h, &d)) {
      fprintf(stderr, "capget failed: %s
", strerror(errno));
  }
  else {
      fprintf(stderr, "CAP_SYS_RESOURCE: %s, %s, %s
",
          d.effective & (1 << CAP_SYS_RESOURCE) ? "effective" : "not effective",
          d.permitted & (1 << CAP_SYS_RESOURCE) ? "permitted" : "not permitted",
          d.inheritable & (1 << CAP_SYS_RESOURCE) ? "inheritable" : "not inheritable");
  }

  fp = fopen("/proc/self/oom_score_adj", "r+");
  if (!fp) {
    fprintf(stderr, "failed to open /proc/self/oom_score_adj: %s
", strerror(errno));
    return 1;
  }
  else {
    read_value(fp);
    write_value(fp);
    read_value(fp);
    fclose(fp);
  }
  return 0;
}

dvk · Accepted Answer

This one was very interesting to crack, took me a while.

The first real hint was this answer to a different question: https://unix.stackexchange.com/questions/364568/how-to-read-the-proc-pid-fd-directory-of-a-process-which-has-a-linux-capabil - just wanted to give the credit.

The reason it does not work as is

The real reason you get "permission denied" there is files under /proc/self/ are owned by root if the process has any capabilities - it's not about CAP_SYS_RESOURCE or about oom_* files specifically. You can verify this by calling stat and using different capabilities. Quoting man 5 proc:

/proc/[pid]

There is a numerical subdirectory for each running process; the subdirectory is named by the process ID.

Each /proc/[pid] subdirectory contains the pseudo-files and directories described below. These files are normally owned by the effective user and effective group ID of the process. However, as a security measure, the ownership is made root:root if the process's "dumpable" attribute is set to a value other than 1. This attribute may change for the following reasons:

The attribute was explicitly set via the prctl(2) PR_SET_DUMPABLE operation.

The attribute was reset to the value in the file /proc/sys/fs/suid_dumpable (described below), for the reasons described in prctl(2).

Resetting the "dumpable" attribute to 1 reverts the ownership of the /proc/[pid]/* files to the process's real UID and real GID.

This already hints to the solution, but first let's dig a little deeper and see that man prctl:

PR_SET_DUMPABLE (since Linux 2.3.20)

Set the state of the "dumpable" flag, which determines whether core dumps are produced for the calling process upon delivery of a signal whose default behavior is to produce a core dump.

In kernels up to and including 2.6.12, arg2 must be either 0 (SUID_DUMP_DISABLE, process is not dumpable) or 1 (SUID_DUMP_USER, process is dumpable). Between kernels 2.6.13 and 2.6.17, the value 2 was also permitted, which caused any binary which normally would not be dumped to be dumped readable by root only; for security reasons, this feature has been removed. (See also the description of /proc/sys/fs/suid_dumpable in proc(5).)

Normally, this flag is set to 1. However, it is reset to the current value contained in the file /proc/sys/fs/suid_dumpable (which by default has the value 0), in the following circumstances:

The process's effective user or group ID is changed.

The process's filesystem user or group ID is changed (see credentials(7)).

The process executes (execve(2)) a set-user-ID or set-group-ID program, resulting in a change of either the effective user ID or the effective group ID.

The process executes (execve(2)) a program that has file capabilities (see capabilities(7)), but only if the permitted capabilities gained exceed those already permitted for the process.

Processes that are not dumpable can not be attached via ptrace(2) PTRACE_ATTACH; see ptrace(2) for further details.

If a process is not dumpable, the ownership of files in the process's /proc/[pid] directory is affected as described in proc(5).

Now it's clear: our process has a capability that the shell used to launch it did not have, thus the dumpable attribute was set to false, thus files under /proc/self/ are owned by root rather than the current user.

How to make it work

The fix is as simple as re-setting that dumpable attribute before trying to open the file. Stick the following or something similar before opening the file:

prctl(PR_SET_DUMPABLE, 1, 0, 0, 0);

Hope that helps ;)

Nominal Animal · Answer

This is not an answer (dvk already provided the answer to the stated question), but an extended comment describing the often overlooked, possibly very dangerous side effects, of reducing /proc/self/oom_score_adj.

In summary, using prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) will allow a process with CAP_SYS_RESOURCE capability (conveyed via e.g. filesystem capabilities) to modify the oom_score_adj of any other process owned by the same user, including their own.

(By default, a process that has capabilities is not dumpable, so a core dump is not generated even when the process is killed by a signal whose disposition is to generate a core.)

The dangers I'd like to comment on, are how the oom_score_adj range is inherited, and what it means to change it for processes that create child processes. (Thanks to dvk for some corrections.)

The Linux kernel maintains an internal value, oom_score_adj_min, for each process. The user (or the process itself) can modify the oom_score_adj to any value between oom_score_adj_min and OOM_SCORE_ADJ_MAX. The higher the value, the more likely the process is to be killed.

When a process is created, it will inherit its oom_score_adj_min from its parent. The original parent of all processes, init, has an initial oom_score_adj_min of 0.

To reduce the oom_score_adj below oom_score_adj_min, a process that either has superuser privileges, or has the CAP_SYS_RESOURCE and is dumpable, writes the new score to /proc/PID/oom_score_adj. In this case, oom_score_adj_min is also set to the same value.

(You can verify this by examining fs/proc/base.c:__set_oom_adj() in the Linux kernel; see the assignments to task->signal->oom_score_adj_min.)

The problem is that the oom_score_adj_min value sticks, except when update by a process that has the CAP_SYS_RESOURCE capability. (Note: I originally thought that it could not be raised at all, but I was wrong.)

For example, if you have a high-value service daemon that has its oom_score_adj_min reduced, running without the CAP_SYS_RESOURCE capability, increasing the oom_score_adj before forking child processes will cause the child processes to inherit the new oom_score_adj, but the original oom_score_adj_min. This means such child processes can reduce their oom_score_adj to that of their parent service daemon, without any privileges or capabilities.

(Because there are just two thousand and one possible oom_score_adj values (-1000 to 1000, inclusive), and only a thousand of those reduce the chance of a process to be killed (the negative ones, zero being the default) compared to "default", a nefarious process only needs to do ten or eleven writes to /proc/self/oom_score_adj to make the OOM killer avoid it as much as possible, by using a binary search: first, it will try -500. If it succeeds, the oom_score_adj_min is between -1000 and -500. If it fails, the oom_score_adj_min is between -499 and 1000. By halving the range at each attempt, it can set oom_score_adj to the kernel-internal minimum for that process, oom_score_adj_min, in ten or eleven writes, depending on what the initial oom_score_adj value was.)

Of course, there are mitigations and strategies to avoid the inheritance problem.

For example, if you have an important process that the OOM killer should leave alone, that should not create child processes, you should run it using a dedicated user account that has the RLIMIT_NPROC set to a suitably small value.

If you have a service that creates new child processes, but you want the parent to be less likely to be OOM killed than other processes, and you do not want the children to inherit that, there are two approaches that work.

Your service can at startup fork a child process to create further child processes, before it lowers its oom_score_adj. This makes the child processes inherit their oom_score_adj_min (and oom_score_adj) from the process that started the service.
Your service can keep CAP_SYS_RESOURCE in the CAP_PERMITTED set, but add or remove it from the CAP_EFFECTIVE set as needed.

When the CAP_SYS_RESOURCE is in the CAP_EFFECTIVE set, adjusting oom_score_adj also sets the oom_score_adj_min to that same value.

When CAP_SYS_RESOURCE is not in the CAP_EFFECTIVE set, you cannot decrement oom_score_adj below the corresponding oom_score_adj_min. oom_score_adj_min is unchanged even when oom_score_adj is modified.

It does make sense to put work that can be canceled/killed in an OOM situation into child processes with higher oom_score_adj values. If an OOM situation does occur -- for example, on an embedded appliance --, the core service daemon has a much higher chance of surviving, even when the worker child processes are killed. Of course, the core daemon itself should not be allocating dynamic memory in response to client requests, as any bug in it may not just crash that daemon, but bring the entire system to a halt (in an OOM situation where basically everything but the original cause, the core daemon, gets killed).

Cannot open /proc/self/oom_score_adj when I have the right capability

Tags:

c

linux

legoscia

2 Answers

The reason it does not work as is

How to make it work

dvk

Nominal Animal

Recent Activity

Donate For Us

Cannot open /proc/self/oom_score_adj when I have the right capability

Tags:

c

linux

legoscia

2 Answers

The reason it does not work as is

How to make it work

dvk

Nominal Animal

Related questions

Recent Activity

Donate For Us