Fork and core dump with threads

Tags:

Similar points to the one in this question have been raised before here and here, and I'm aware of the Google coredump library (which I've appraised and found lacking, though I might try and work on that if I understand the problem better).

I want to obtain a core dump of a running Linux process without interrupting the process. The natural approach is to say:

if (!fork()) { abort(); }

Since the forked process gets a fixed snapshot copy of the original process's memory, I should get a complete core dump, and since the copy uses copy-on-write, it should generally be cheap. However, a critical shortcoming of this approach is that fork() only forks the current thread, and all other threads of the original process won't exist in the forked copy.

My question is whether it is possible to somehow obtain the relevant data of the other, original threads. I'm not entirely sure how to approach this problem, but here are a couple of sub-questions I've come up with:

Is the memory that contains all of the threads' stacks still available and accessible in the forked process?
Is it possible to (quicky) enumerate all the running threads in the original process and store the addresses of the bases of their stacks? As I understand it, the base of a thread stack on Linux contains a pointer to the kernel's thread bookkeeping data, so...
with the stored thread base addresses, could you read out the relevant data for each of the original threads in the forked process?

If that is possible, perhaps it would only be a matter of appending the data of the other threads to the core dump. However, if that data is lost at the point of the fork already, then there doesn't seem to be any hope for this approach.

220

asked Aug 28 '13 12:08

Kerrek SB

2 Answers

Are you familiar with process checkpoint-restart? In particular, CRIU? It seems to me it might provide an easy option for you.

I want to obtain a core dump of a running Linux process without interrupting the process [and] to somehow obtain the relevant data of the other, original threads.

Forget about not interrupting the process. If you think about it, a core dump has to interrupt the process for the duration of the dump; your true goal must therefore be to minimize the duration of this interruption. Your original idea of using fork() does interrupt the process, it just does so for a very short time.

Is the memory that contains all of the threads' stacks still available and accessible in the forked process?

No. The fork() only retains the thread that does the actual call, and the stacks for the rest of the threads are lost.

Here is the procedure I'd use, assuming CRIU is unsuitable:

Have a parent process that generates a core dump of the child process whenever the child is stopped. (Note that more than one consecutive stop event may be generated; only the first one until the next continue event should be acted on.)

You can detect the stop/continue events using waitpid(child,,WUNTRACED|WCONTINUED).
Optional: Use sched_setaffinity() to restrict the process to a single CPU, and sched_setscheduler() (and perhaps sched_setparam()) to drop the process priority to IDLE.

You can do this from the parent process, which only needs the CAP_SYS_NICE capability (which you can give it using setcap 'cap_sys_nice=pe' parent-binary to the parent binary, if you have filesystem capabilities enabled like most current Linux distributions do), in both the effective and permitted sets.

The intent is to minimize the progress of the other threads between the moment a thread decides it wants a snapshot/dump, and the moment when all threads have been stopped. I have not tested how long it takes for the changes to take effect -- certainly they only happen at the end of their current timeslices at the very earliest. So, this step should probably be done a bit beforehand.

Personally, I don't bother. On my four-core machine, the following SIGSTOP alone yields similar latencies between threads as a mutex or a semaphore does, so I don't see any need to strive for even better synchronization.
When a thread in the child process decides it wants to take a snapshot of itself, it sends a SIGSTOP to itself (via kill(getpid(), SIGSTOP)). This stops all threads in the process.

The parent process will receive the notification that the child was stopped. It will first examines /proc/PID/task/ to obtain the TIDs for each thread of the child process (and perhaps /proc/PID/task/TID/ pseudofiles for other information), then attaches to each TID using ptrace(PTRACE_ATTACH, TID). Obviously, ptrace(PTRACE_GETREGS, TID, ...) will obtain the per-thread register states, which can be used in conjunction with /proc/PID/task/TID/smaps and /proc/PID/task/TID/mem to obtain the per-thread stack trace, and whatever other information you're interested in. (For example, you could create a debugger-compatible core file for each thread.)

When the parent process is done grabbing the dump, it lets the child process continue. I believe you need to send a separate SIGCONT signal to let the entire child process continue, instead of just relying on ptrace(PTRACE_CONT, TID), but I haven't checked this; do verify this, please.

I do believe that the above will yield a minimal delay in wall clock time between the threads in the process stopping. Quick tests on AMD Athlon II X4 640 on Xubuntu and a 3.8.0-29-generic kernel indicates tight loops incrementing a volatile variable in the other threads only advance the counters by a few thousand, depending on the number of threads (there's too much noise in the few tests I made to say anything more specific).

Limiting the process to a single CPU, and even to IDLE priority, will drastically reduce that delay even further. CAP_SYS_NICE capability allows the parent to not only reduce the priority of the child process, but also lift the priority back to original levels; filesystem capabilities mean the parent process does not even have to be setuid, as CAP_SYS_NICE alone suffices. (I think it'd be safe enough -- with some good checks in the parent program -- to be installed in e.g. university computers, where students are quite active in finding interesting ways to exploit the installed programs.)

It is possible to create a kernel patch (or module) that provides a boosted kill(getpid(), SIGSTOP) that also tries to kick off the other threads from running CPUs, and thus try to make the delay between the threads stopping even smaller. Personally, I wouldn't bother. Even without the CPU/priority manipulation I get sufficient synchronization (small enough delays between the times the threads are stopped).

Do you need some example code to illustrate my ideas above?

174

answered Oct 09 '22 00:10

Nominal Animal

When you fork you get a full copy of the running processes memory. This includes all thread's stacks (after all you could have valid pointers into them). But only the calling thread continues to execute in the child.

You can easily test this. Make a multithreaded program and run:

pid_t parent_pid = getpid();

if (!fork()) {
    kill(parent_pid, SIGSTOP);

    char buffer[0x1000];

    pid_t child_pid = getpid();
    sprintf(buffer, "diff /proc/%d/maps /proc/%d/maps", parent_pid, child_pid);

    system(buffer);

    kill(parent_pid, SIGTERM);

    return 0;
} else for (;;);

So all your memory is there and when you create a core dump it will contain all the other threads stacks (provided your maximum core file size permits it). The only pieces that will be missing are their register sets. If you need those then you will have to ptrace your parent to obtain them.

You should keep in mind though that core dumps are not designed to contain runtime information of more then one thread - the one that caused the core dump.

To answer some of your other questions:

You can enumerate threads by going through /proc/[pid]/tasks, but you can not identify their stack bases until you ptrace them.

Yes, you have full access to the other threads stacks snapshots (see above) from the forked process. It is not trivial to determine them, but they do get put into a core dump provided the core file size permits it. Your best bet is to save them in some globally accessible structure if you can upon creation.

answered Oct 09 '22 00:10

Sergey L.

Related questions
                            
                                How to use the condensation algorithm available in OpenCV?
                            
                                Built-in functions in c++
                            
                                How to use boost::serialization between x86 and x64 platforms
                            
                                Culling obstructed points in a point cloud
                            
                                How can I Send & Receive SMS from QT mobile application for symbian?
                            
                                Using static functions of a base class without specifying parameters to avoid ambiguity
                            
                                Barton-Nackman vs std::enable_if
                            
                                How to resolve ambiguous overloaded function call?
                            
                                Boost IO Stream and ZLib speed up
                            
                                Embedding assembler within C++ acceptable?
                            
                                C++ type erasure with template copy and move constructor
                            
                                Clear And Efficient 3D Range Tree Implementation
                            
                                C++ Performance: template vs boost.any
                            
                                Two different results on GCC 4.6 and 4.7 for template template deduction
                            
                                Qt Creator error code -1073741819
                            
                                Deducing template argument for nested template fails
                            
                                Could I wrap c++ library to Adobe Air native extension for mobile IOS/Android
                            
                                Incomplete type is not allowed in a class, but is allowed in a class template
                            
                                Why is a unique_ptr not freed after a constructor calls an exception?
                            
                                Which array/list should I use?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fork and core dump with threads

Tags:

c++

c

linux

pthreads

coredump

Kerrek SB

People also ask

2 Answers

Nominal Animal

Sergey L.

Recent Activity

Donate For Us