Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fork() leaking? Taking longer and longer to fork a simple process

I have a system in which two identical processes are run (let's call them replicas). When signaled, a replica will duplicate itself by using the fork() call. A third process selects one of the processes to kill randomly, and then signals the other to create a replacement. Functionally, the system works well; it can kill / respawn replicas all day except for the performance issue.

The fork() call is taking longer and longer. The following is the simplest setup that still displays the problem. The timing be is displayed in the graph below: fork timing

The replica's code is the following:

void restartHandler(int signo) {
// fork
  timestamp_t last = generate_timestamp();
  pid_t currentPID = fork();


  if (currentPID >= 0) { // Successful fork
    if (currentPID == 0) { // Child process
      timestamp_t current = generate_timestamp();
      printf("%lld\n", current - last);

      // unblock the signal
      sigset_t signal_set;
      sigemptyset(&signal_set);
      sigaddset(&signal_set, SIGUSR1);
      sigprocmask(SIG_UNBLOCK, &signal_set, NULL);

      return;
    } else {   // Parent just returns
      waitpid(-1, NULL, WNOHANG);
      return;
    }
  } else {
    printf("Fork error!\n");
    return;
  }
}

int main(int argc, const char **argv) {
  if (signal(SIGUSR1, restartHandler) == SIG_ERR) {
    perror("Failed to register the restart handler");
    return -1;
  }

  while(1) {
    sleep(1);
  }

  return 0;
}

The longer the system runs, the worse it gets.

Sorry to lack a specific question, but does anyone have any idea / clues as to what is going on? It seems to me that there is a resource leak in the kernel (thus the linux-kernel tag), but I don't know where where to start looking.

What I have tried:

  • Tried kmemleak, which did not catch anything. This implies that if there is some memory "leak" that it is still reachable.
  • /proc/<pid>/maps is not growing.
  • Currently running the 3.14 kernel with RT patch (note this happens with non-rt and rt processes), and have also tried on 3.2.
  • zombie processes are not an issue. I have tried a version in which I setup another process as a subreaper using prctl
  • I first noticed this slowdown in a system in which the timing measurements are being down outside of the restarted process; same behavior.

Any hints? Anything I can provide to help? Thanks!

like image 416
superdesk Avatar asked Dec 08 '14 23:12

superdesk


People also ask

Which one is the correct explanation of forking a process?

Fork is a function in Unix that is used to generate a duplicate of particular process by creating two simultaneous executing processes of a program. These two processes are typically called the "parent" and "child" processes. They use multitasking protocols to share system resources.

What happens when you fork a process?

When a process calls fork, it is deemed the parent process and the newly created process is its child. After the fork, both processes not only run the same program, but they resume execution as though both had called the system call.

How does forking affect the parent process?

fork() creates a new process by duplicating the calling process. The new process is referred to as the child process. The calling process is referred to as the parent process. The child process and the parent process run in separate memory spaces.

How many processes does fork make?

Each invocation of fork() results in two processes, the child and the parent. Thus the first fork results in two processes.


2 Answers

The slowdown is caused by an accumulation of anonymous vmas, and is a known problem. The problem is evident when there are a large number of fork() calls and the parent exits before the children. The following code recreates the problem (source Daniel Forrest):

#include <unistd.h>

int main(int argc, char *argv[])
{
  pid_t pid;
  while (1) {
    pid = fork();
    if (pid == -1) {
      /* error */
      return 1;
    }
    if (pid) {
      /* parent */
      sleep(2);
      break;
    }
    else {
      /* child */
      sleep(1);
    }
  }
  return 0;
}

The behavior can be confirmed by checking anon_vma in /proc/slabinfo.

There is a patch (source) which limits the length of copied anon_vma_chain to five. I can confirm that the patch fixes the problem.

As for how I eventually found the problem, I finally just started putting printk calls throughout the fork code, checking the times shown in dmesg. Eventually I saw that it was the call to anon_vma_fork which was taking longer and longer. Then it was a quick matter of google searching.

It took a rather long time, so I would still appreciate any suggestions for a better way to have gone about tracking down the problem. And to all of those that already spent time trying to assist me, Thank You.

like image 59
superdesk Avatar answered Nov 15 '22 19:11

superdesk


Maybe you could try using the generic wait() call, rather than waitpid()? It's just a guess, but I heard it was better from a professor in undergrad. Also, have you tried using address sanitizer

Also, you can use GDB to debug a child process as well (if you haven't already tried that). You can use follow-fork-mode:

set follow-fork-mode child

but that is only capable of debugging the parent. You can debug both by getting the pid of the child process, calling sleep() after forking then:

attach <child process pid>

then call:

detach

This is useful because you can dump memory leaks into valgrind. Just call valgrind with

valgrind --vgdb-error=0...<executable>

then set some relevant breakpoints, and continue through your program until you hit your breakpoints then search for leaks:

monitor leak_check full reachable any

then:

monitor block_list <loss_record_nr>
like image 24
J_COL Avatar answered Nov 15 '22 18:11

J_COL