Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are PID-files still flawed when doing it 'right'?

Tags:

linux

bash

Restarting a service is often implemented via a PID file - I.e. the process ID is written to some file and based on that number the stop command will kill the process (or before a restart).

When you think about it (or if you don't like this, then search) you'll find that this is problematic as every PID could be reused. Imagine a complete server restart where you call './your-script.sh start' at startup (e.g. @reboot in crontab). Now your-script.sh will kill an arbitrary PID because it has stored the PID from the live before the restart.

One workaround I can imagine is to store an additional information, so that you could do 'ps -pid | grep ' and only if this returns something you kill it. Or are there better options in terms of reliability and/or simplicity?

#!/bin/bash

function start() {
  nohub java -jar somejar.jar >> file.log 2>&1 &
  PID=$!
  # one could even store the "ps -$PID" information but this makes the
  # killing too specific e.g. if some arguments will be added or similar
  echo "$PID somejar.jar" > $PID_FILE
}

function stop() {
  if [[ -f "$PID_FILE" ]]; then
    PID=$(cut -f1 -d' ' $PID_FILE)
    # now get the second information and grep the process list with this
    PID_INFO=$(cut -f2 -d' ' $PID_FILE)
    RES=$(ps -$PID | grep $PID_INFO)
    if [[ "x$RES" != "x" ]]; then
       kill $PID
    fi
  fi
}
like image 529
Karussell Avatar asked Sep 18 '14 06:09

Karussell


1 Answers

The problem with PID files is multifold, not just limited to recycling and reboot.

The bigger issue is the fact that there is an unavoidable disconnect/race between the information in the PID file and the state of the process.

This is the flow of using PID files:

  1. You fork & exec a process. The "parent" process knows the PID of the fork and has guarantees that this PID is reserved exclusively for his fork.
  2. Your parent writes the PID of the fork to a file.
  3. Your parent dies, along with it the guarantee about PID exclusivity.
  4. A different process reads the number in the PID file.
  5. The different process checks whether there is a process on the system with the same PID as the one he read.
  6. The different process sends a signal to the process with the PID he read.

In (1) everything is fine and dandy. We have a PID and we are guaranteed by the kernel that the number is reserved for our intended process.

In (2) you are yielding control of the PID to other processes that do not have this guarantee. In itself not an issue, but such an act is rarely if ever without fault.

In (3) your parent process dies. It alone had the kernel guarantee on PID exclusivity. It may or may not have done a wait(2) on the PID. The true status of the intended process is lost, all we have left is an identifier in the PID file which may or may not refer to the intended process.

In (4) a process without any guarantees reads the PID file, any use of this number has only arbitrary success.

In (5) a process without any guarantees actually uses the identifier for something, this is the first point where we're actually doing something bad: we're querying the kernel using a process identifier that may or may not refer to the intended process. The answer we'll get back will be on the state of the process with that PID, not necessarily of our intended process at all.

In (6) we make the worst mistake: we're actually performing a mutating action, intended to impact our initially started process but by no means guaranteeing that intent. We could be signalling any random system process instead.

Why is this? What kind of stuff can happen to mess with the PID?

Anywhere after (1), the real process may die. So long as the parent retains his guarantee on the PID's exclusivity, the kernel will not recycle the PID. It will still exist and refer to what used to be your process (we call this a "zombie" process, your real process died but the PID is still reserved for it alone). No other process can use this PID and signalling it will not reach any process at all.

As soon as the parent releases his guarantee or after (3), the kernel recycles the PID of the dead process. The zombie is gone and the PID now free to be used by any other new process that is forked. Say you're compiling something, thousands of small processes get spawned. The kernel picks random or sequential (depending on its configuration) new PIDs for each. You're done, now you restart apache. The kernel reuses the freed PID of your dead process for something important.

The PID file still contains the PID, though. Any process that reads the PID file (4) is assuming that this number refers to your long dead process.

Any action (5) (6) you take with the number you read will target the new process, not the old one.

Not only that, but you cannot perform any check prior to your action since there is an unavoidable race between any check you can perform and any action you can perform. If you first look at ps to see what the "name" of your process is (not that this is a really awesome guarantee of anything, please don't do this), and then signal it, the time between your ps check and your signal could still have seen the process die, and/or get recycled by a new process. The root of all of these problems is that the kernel is not giving you any exclusive use guarantees on the PID, since you are not its parent.

Moral of the story: Do NOT give the PID of your children to anyone else. The parent and only the parent should use it, because he is the only one on the system (save the kernel) with any guarantees on its existence and identity.

This usually means keeping the parent alive and instead of signalling something to terminate the process, talking to the parent instead; by means of sockets or the like. See http://smarden.org/runit/ et al.

like image 145
lhunath Avatar answered Oct 10 '22 21:10

lhunath